XPath: Select following siblings until certain class - xpath

I have the following html snippet:
<table>
<tr>
<td class="foo">a</td>
<td class="bar">1</td>
<td class="bar">2</td>
<td class="foo">b</td>
<td class="bar">3</td>
<td class="bar">4</td>
<td class="bar">5</td>
<td class="foo">c</td>
<td class="bar">6</td>
<td class="bar">7</td>
</tr>
</table>
I'm looking for a XPath 1.0 expression that starts at a .foo element and selects all following .bar elements before the next .foo element.
For example: I start at a and want to select only 1 and 2.
Or I start at b and want to select 3, 4 and 5.
Background: I have to find an XPath expression for this method (using Java and Selenium):
public List<WebElement> bar(WebElement foo) {
return foo.findElements(By.xpath("./following-sibling::td[#class='bar']..."));
}
Is there a way to solve the problem?
The expression should work for all .foo elements without using any external variables.
Thanks for your help!
Update: There is apparently no solution for these special circumstances. But if you have fewer limitations, the provided expressions work perfectly.

Good question!
The following expression will give you 1..2, 3..5 or 6..7, depending on input X + 1, where X is the set you want (2 gives 1-2, 3 gives 3-.5 etc). In the example, I select the third set, hence it has [4]:
/table/tr[1]
/td[not(#class = 'foo')]
[
generate-id(../td[#class='foo'][4])
= generate-id(
preceding-sibling::td[#class='foo'][1]
/following-sibling::td[#class='foo'][1])
]
The beauty of this expression (imnsho) is that you can index by the given set (as opposed to index by relative position) and that is has only one place where you need to update the expression. If you want the sixth set, just type [7].
This expression works for any situation where you have siblings where you need the siblings between any two nodes of the same requirement (#class = 'foo'). I'll update with an explanation.
Replace the [4] in the expression with whatever set you need, plus 1. In oXygen, the above expression shows me the following selection:
Explanation
/table/tr[1]
Selects the first tr.
/td[not(#class = 'foo')]
Selects any td not foo
generate-id(../td[#class='foo'][4])
Gets the identity of the xth foo, in this case, this selects empty, and returns empty. In all other cases, it will return the identity of the next foo that we are interested in.
generate-id(
preceding-sibling::td[#class='foo'][1]
/following-sibling::td[#class='foo'][1])
Gets the identity of the first previous foo (counting backward from any non-foo element) and from there, the first following foo. In the case of node 7, this returns the identity of nothingness, resulting in true for our example case of [4]. In the case of node 3, this will result in c, which is not equal to nothingness, resulting in false.
If the example would have value [2], this last bit would return node b for nodes 1 and 2, which is equal to the identity of ../td[#class='foo'][2], returning true. For nodes 4 and 7 etc, this will return false.
Update, alternative #1
We can replace the generate-id function with a count-preceding-sibling function. Since the count of the siblings before the two foo nodes is different for each, this works as an alternative for generate-id.
By now it starts to grow just as wieldy as GSerg's answer, though:
/table/tr[1]
/td[not(#class = 'foo')]
[
count(../td[#class='foo'][4]/preceding-sibling::*)
= count(
preceding-sibling::td[#class='foo'][1]
/following-sibling::td[#class='foo'][1]/preceding-sibling::*)
]
The same "indexing" method applies. Where I write [4] above, replace it with the nth + 1 of the intersection position you are interested in.

If the current node is one of the td[#class'foo'] elements you can use the below xpath to get the following td[#class='bar'] elements, which are preceding to next td of foo:
following-sibling::td[#class='bar'][generate-id(preceding-sibling::td[#class='foo'][1]) = generate-id(current())]
Here, you select only those td[#class='bar'] whose first preceding td[#class='foo'] is same as the current node you are iterating on(confirmed using generate-id()).

So you want an intersection of two sets:
following-sibling::td[#class='bar'] that follow your starting td[#class='foo'] node
preceding-sibling::td[#class='bar'] that precede the next td[#class='foo'] node
Given the formula from the linked question, it is not difficult to get:
//td[1]/following-sibling::td[#class='bar'][count(. | (//td[1]/following-sibling::td[#class='foo'])[1]/preceding-sibling::td[#class='bar']) = count((//td[1]/following-sibling::td[#class='foo'])[1]/preceding-sibling::td[#class='bar'])]
However this will return an empty set for the last foo node because there is no next foo node to take precedings from.
So you want a difference of two sets:
following-sibling::td[#class='bar'] that follow your starting td[#class='foo'] node
following-sibling::td[#class='bar'] that follow the next td[#class='foo'] node
Given the formula from the linked question, it is not difficult to get:
//td[1]/following-sibling::td[#class='bar'][
count(. | (//td[1]/following-sibling::td[#class='foo'])[1]/following-sibling::td[#class='bar'])
!=
count((//td[1]/following-sibling::td[#class='foo'])[1]/following-sibling::td[#class='bar'])
]
The only amendable bit is the starting point, //td[1] (three times).
Now this will properly return bar nodes even for the last foo node.
The above was written under impression that you need to have a single XPath query and nothing more. Now that it's clear you don't, you can easily solve your problem with more than one XPath query and some manual list filtering on referential equality, as I already mentioned in a comment.
In C# that would be:
XmlNode context = xmlDocument.SelectSingleNode("//td[8]");
XmlNode nextFoo = context.SelectSingleNode("(./following-sibling::td[#class='foo'])[1]");
IEnumerable<XmlNode> result = context.SelectNodes("./following-sibling::td[#class='bar']").Cast<XmlNode>();
if (nextFoo != null)
{
// Intersect filters using referential equality by default
result = result.Intersect(nextFoo.SelectNodes("./preceding-sibling::td[#class='bar']").Cast<XmlNode>());
}
I'm sure it's trivial to convert to Java.

Pretty straightforward (example for 'a' td) but not very optimal:
//td[
#class='bar' and
preceding-sibling::td[#class='foo'][1][text() = 'a'] and
(
not(following-sibling::td[#class='foo']) or
following-sibling::td[#class='foo'][1][preceding-sibling::td[#class='foo'][1][text() = 'a']]
)
]

Related

How can I locate items using xpath from below elements?

I've created some xpath expressions to locate the first item by it's "index" after "h4". However, I did something wrong that is why it doesn't work at all. I expect someone to take a look into it and give me a workaround.
I tried with:
//div[#id="schoolDetail"][1]/text() --For the Name
//div[#id="schoolDetail"]//br[0]/text() --For the PO Box
Elements within which items I would like the expression to locate is pasted below:
<div id="schoolDetail" style=""><h4>School Detail: Click here to go back to list</h4> GOLD DUST FLYING SERVICE, INC.<br>PO Box 75<br><br>TALLADEGA AL 36260<br> <br>Airport: TALLADEGA MUNICIPAL (ASN)<br>Manager: JEAN WAGNON<br>Phone: 2563620895<br>Email: golddustflyingse#bellsouth.net<br>Web: <br><br>View in AOPA Airports (Opens in new tab) <br><br></div>
By the way, the resulting values should be:
GOLD DUST FLYING SERVICE, INC.
PO Box 75
Try to locate required text nodes by appropriate index:
//div[#id="schoolDetail"]/text()[1] // For "GOLD DUST FLYING SERVICE, INC."
//div[#id="schoolDetail"]/text()[2] // For "PO Box 75"
Locator to get both elements:
//*[#id='schoolDetail']/text()[position()<3]
Explanation:
[x] - xPath could sort values using predicate in square brackets.
x - could be integer, in this case it will automatically be compared with element's position in this way [position()=x]:
//div[2] - searches for 2nd div, similar to div[position()=2]
In case predicate [x] is not an integer - it will be automatically converted to boolean value and will return only elements, where result of x is true, for example:
div[position() <= 4] - search for first four div elements, as 4 <= 4, but on the 5th and above element position will be more than 4
Important: please check following locators on this page:
https://www.w3schools.com/tags/ref_httpmessages.asp
//table//tr[1] - will return every 1st row in each table ! (12 found
elements, same as tables on the page)
(//table//tr)[1] - will return 1st row in the first found table (1 found element)

Select all nodes until a specific given node/tag

Given the following markup:
<div id="about">
<dl>
<dt>Date</dt>
<dd>1872</dd>
<dt>Names</dt>
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
<dt>Status</dt>
<dd>on</dd>
<dt>Another Field</dt>
<dd>X</dd>
<dd>Y</dd>
</dl>
</div>
I'm trying to extract all the <dd> nodes following <dt>Names</dt> but only until another <dt> starts. In this case, I'm after the following nodes:
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
I'm trying the following XPath code, but it's not working as intended.
xpath("//div[#id='about']/dl/dt[contains(text(),'Names')]/following-sibling::dd[not(following-sibling::dt)]/text()")
Any thoughts on how to fix it?
Many thanks.
Update: much simpler solution
There is a prerequisite in your situation, that is that the anchor item always is the first preceding sibling with a certain property. Because of that, here's a much simpler way of writing the below complex expression:
/div/dl/dd[preceding-sibling::dt[1][. = 'Names']]
In other words:
select any dd
that has a first preceding sibling dt (the preceding sibling axis counts backwards)
that itself has a value of "Names"
As can be seen in the following screenshot from oXygen, it selects the nodes you wanted to select (and if you change "Names" to "Status" or "Another Field", it will select only the following ones before the next dt also).
Original complex solution (leaving in for reference)
This is far easier in XPath 2.0, but let's assume you can only use XPath 1.0. The trick is to count the number of preceding siblings from your anchor element (the one with "Names" in it), and disregard any that have the wrong count (i.e., when we cross over <dt>Status</dt>, the number of preceding siblings has increased).
For XPath 1.0, remove the comments between (: and :) (in XPath, whitespace is insignificant, you can make it a multiline XPath for readability, but in 1.0, comments are not possible)
/div/dl/dd
(: any dd having a dt before it with "Names" :)
[preceding-sibling::dt[. = 'Names']]
(: count the preceding siblings up to dt with "Names", add one to include 'self' :)
[count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1
=
(: compare with count of all preceding siblings :)
count(preceding-sibling::dt)]
As a one-liner:
/div/dl/dd[preceding-sibling::dt[. = 'Names']][count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1 = count(preceding-sibling::dt)]
How about this:
//dd[preceding-sibling::dt[contains(., 'Names')]][following-sibling::dt]

How to return X elements [Selenium]?

A page loads 35.000 elements, which only the first 10 are of interest to me. Returning all elements makes the scraping extremely slow.
I only succeeded in either returning the first element with:
driver.find_element_by
Or returning all, 35.000 elements, with:
driver.find_elements_by
Anyone knows a way to return x amount of elements found?
Selenium does not provide a facility that allows returning only a slice of the .find_elements... calls. A general solution if you want to optimize things so that you do not need to have Selenium return every single element is perform the slice operation on the browser side, in JavaScript. I present this solution in this answer here. If you want to use XPath for selecting the DOM nodes, you could adapt the answer here to that, or you could use the method in another answer I've submitted.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")
# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered.
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
html.push("<p class='test'>"+ i + "</p>");
}
document.body.innerHTML += html.join("");
""")
elements = driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10);
""")
# Verify that we got the first 10 elements by outputting the text they
# contain to the console. The loop here is for illustration purposes
# to show that the `elements` array contains what we want. In real
# code, if I wanted to process the text of the first 10 elements, I'd
# do what I show next.
for element in elements:
print element.text
# A better way to get the text of the first 10 elements. This results
# in 1 round-trip between this script and the browser. The loop above
# would take 10 round-trips.
print driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10)
.map(function (x) { return x.textContent; });;
""")
driver.quit()
The Array.prototype.slice.call rigmarole is needed because what document.querySelectorAll returns looks like an Array but is not actually an Array object. (It is a NodeList.) So it does not have a .slice method but you can pass it to Array's slice method.
Here is a significantly different approach presented as a different answer because some people will prefer this one to the other one I gave, or the other one to this one.
This one relies on using XPath to slice the results:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")
# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered. These paragraphs are put into
# individual `div` to make sure they are not siblings of one
# another. (This prevents offering a naive XPath expression that would
# work only if they *are* siblings.)
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
html.push("<div><p class='test'>"+ i + "</p></div>");
}
document.body.innerHTML += html.join("");
""")
elements = driver.find_elements_by_xpath(
"(//p[#class='test'])[position() < 11]")
for element in elements:
print element.text
driver.quit()
Note that XPath uses 1-based indexes so < 11 is indeed the proper expression. The parentheses around the first part of the expression are absolutely necessary. With these parentheses, the [position() < 11] test checks the position each node has in the nodeset which is the result of the expression in parentheses. Without them, the position test would check the position of the nodes relative to their parents nodes, which would match all nodes because all <p> are at the first position in their respective <div>. (This is why I've added those <div> elements above: to show this problem.)
I would use this solution if I were already using XPath for my selection. Otherwise, if I were doing a search by CSS selector or by id I would not convert it to XPath only to perform the slice. I'd use the other method I've shown.

XPath :: running counter two levels

Using the count(preceding-sibling::*) XPath expression one can obtaining incrementing counters. However, can the same also be accomplished in a two-levels deep sequence?
example XML instance
<grandfather>
<father>
<child>a</child>
</father>
<father>
<child>b</child>
<child>c</child>
</father>
</grandfather>
code (with Saxon HE 9.4 jar on the CLASSPATH for XPath 2.0 features)
Trying to get an counter sequence of 1,2 and 3 for the three child nodes with different kinds of XPath expressions:
XPathExpression expr = xpath.compile("/grandfather/father/child");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
System.out.printf("child's index is: %s %s %s, name is: %s\n"
,xpath.compile("count(preceding-sibling::*)").evaluate(node)
,xpath.compile("count(preceding-sibling::child)").evaluate(node)
,xpath.compile("//child/position()").evaluate(doc)
,xpath.compile(".").evaluate(node));
}
The above code prints:
child's index is: 0 0 1, name is: a
child's index is: 0 0 1, name is: b
child's index is: 1 1 1, name is: c
None of the three XPaths I tried managed to produce the correct sequence: 1,2,3. Clearly it can trivially be done using the i loop variable but I want to accomplish it with XPath if possible. Also I need to keep the basic framework of evaluating an XPath expression to get all the nodes to visit and then iterating on that set since that's the way the real application I work on is structured. Basically I visit each node and then need to evaluate a number of XPath expressions on it (node) or on the document (doc); one of these XPAth expressions is supposed to produce this incrementing sequence.
Use the preceding axis with a name test instead.
count(preceding::child)
Using XPath 2.0, there is a much better way to do this. Fetch all <child/> nodes and use the position() function to get the index:
//child/concat("child's index is: ", position(), ", name is: ", text())
You don't say efficiency is important, but I really hate to see this done with O(n^2) code! Jens' solution shows how to do that if you can use the result in the form of a sequence of (position, name) pairs. You could also return an alternating sequence of strings and numbers using //child/(string(.), position()): though you would then want to use the s9api API rather than JAXP, because JAXP can only really handle the data types that arise in XPath 1.0.
If you need to compute the index of each node as part of other processing, it might still be worth computing the index for every node in a single initial pass, and then looking it up in a table. But if you're doing that, the simplest way is surely to iterate over the result of //child and build a map from nodes to the sequence number in the iteration.

How to select all nodes such that their group size is higher than a given value, in XPath

I would like to select all <mynode> elements that have a value that appears a certain number of times (say, x) in all the elements.
Example:
<root>
<mynode>
<attr1>value_1</attr1>
<attr2>value_2</attr2>
</mynode>
<mynode>
<attr1>value_3</attr1>
<attr2>value_3</attr2>
</mynode>
<mynode>
<attr1>value_4</attr1>
<attr2>value_5</attr2>
</mynode>
<mynode>
<attr1>value_6</attr1>
<attr2>value_5</attr2>
</mynode>
</root>
In this case, I want all the <mynode> elements that whose attr2 value occurs > 1 time (x = 1). So, the last two <mynode>s.
Which query I have to perform in order to achieve this target?
If you're using XPath 2.0 or greater, then the following will work:
for $value in distinct-values(/root/mynode/attr2)
return
if (count(/root/mynode[attr2 = $value]) > 1) then
/root/mynode[attr2 = $value]
else ()
For a more detailed discussion see: XPath/XSLT nested predicates: how to get the context of outer predicate?
This is also possible in plain XPath 1.0 (also works in newer versions of XPath); and probably easier to read. Think of your problem as you're looking for all <mynode/>s which have an <att2/> node that also occurs before or after the <mynode/>:
//mynode[attr2 = preceding::attr2 or attr2 = following::attr2]
If <att2/> nodes can also accour inside other elements and you do not want to test for those:
//mynode[attr2 = preceding::mynode/attr2 or attr2 = following::mynode/attr2]

Resources