XPath problem with skipping the element / joining matches - xpath

This is the data:
<p>
<span class="z">XXX</span><br/>
123456<br/>
78910</p>
There are also whitespaces and newlines all over the place.
I need to get only '<br/>123456<br/>78910' skipping the span element.
When I run this xpath: '//p/text()' I get 3 matches: The first - a bunch of spaces and newlines, the second one with 123456 and the third one with 78910.
Is there any other way to skip the span element? Is it possible to somehow join the matches?

It looks like you are trying to select every node after the span element:
/p/span/following-sibling::node()
If you want the text node children without the white space only text nodes:
/p/text()[normalize-space()]

Use:
/p/node()[not(self::span) and (not(self::text[not(normalize-space())]))]
This selects all nodes that a children of the top element p and that if they are text nodes they are not white-space only.

Related

xpath how to extract the element itself and one of its child?

I'm fetching data with python requests & xpath.
<div class="test">
<p>pppp</p>
aaa
<em>bbb</em>
ccc
<span>span</span>
</div>
I want to get aaabbbccc.
I tried //div/*[not(self::p) and not(self::span)]//text() to exclude the p and span element, but it only returns bbb.
What is the correct path?
If the element structure is totally predictable and only the content of text nodes varies, then you can use //div/node()[not(self::p|self::span)]/descendant-or-self::text(). Note that this returns a sequence of text nodes, not a single string. This may also return some whitespace text nodes which you may want to filter out with the predicate [normalize-space(.)].
Another possibility would be //text()[not(parent::p|parent::span)].

How to select a node that is inside a sibling of a parent node using xpath expression?

I'm trying to select a node based on the known text inside a sibling of a parent node. To be clearer my HTML has the following structure:
<k>
<l>Known</l>
</k>
<k>
<l>Desired</l>
</k>
My attempt:
//k//following-sibling::*[text()="Known"]
Returns:
Known
Why?
It's because basically you're selecting any descendant of k with the text Known.
(You're actually matching the l because it's a sibling of the whitespace before it. If you remove the whitespace (including line breaks), your xpath probably won't return anything.)
Try selecting the first following sibling k...
//k[l='Known']/following-sibling::k[1]/l

Using XPath how do I select a node() at a specific position that is also text()?

I want to select the previous node only if it is a text node (and contains only whitespace). I have an XPath like so: path/preceding-sibling::node()[1][normalize-space()='']. This works great but matches both text and element nodes (if the nodes contain only whitespace). Using path/preceding-sibling::text()[1][normalize-space()=''] will select the first preceding node that is a text node which is definitely not what I want if there are any elements in between.
How can I combine the two tests?
You can use self::text() to test if current node is a text node, like so :
path/preceding-sibling::node()[1][self::text() and normalize-space()='']

How to match br tag in XPath text() function

I have a following element.
driver = Selenium::WebDriver.for :phantomjs
driver.xpath("/html/body/form/table/tbody/tr[14]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/font").text
=> "unique\ntext"
But I don't want to rely on unstable table layout, so I decided to use text() function in xpath like:
driver.xpath("//font[text()='unique\ntext']")
=> nil
But as you see, I couldn't find the element by the text() function. The original text is unique<br>text.
How can I match the <br> tag by using XPath?
There is no id or name attributes that I can use.
The text() test selects any text nodes. In this example there are two such nodes, before and after the <br>. It is not the same as the text method or the string value of the parent node.
One way of selecting what you want could be like this:
driver.xpath("//font[ . ='unique\ntext']")
You might need to add extra newlines before or after the text. Note that this relies on Ruby doing the conversion of \n into an actual newline character before passing the query to the XPath processor, so you need to be careful about getting your quotes right. This compares the string-value of the node, which for an element is the concatenation of all the descendent text nodes, which is what you want.
A better solution might be to use the normalize-space() function here (as long as the unique aspect of the text doesn’t depend on the newlines).
Try:
driver.xpath("//font[normalize-space()='unique text']")
Note that all leading and trailing whitespace in the target text has been removed, and any internal whitespace is changed to a single space character.

Contains(.,'text') function for matching text

I'm trying to select things in a table, and currently have the following expression
//*[#id='row']/tbody/tr[contains(., 'user2')]/td[contains(., 'user2')]
however, this is obviously a problem when there are users entered such as 'user 25', because that also contains 'user 2'. Can someone help me fix what's wrong with the following expressions in which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
I tried normalizing space too, didn't seem to work
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
If it will help here is the html of the page
<table id="row" class="gradientTable">
<td>
user2
</td>
<td>User2</td>
<td>User2</td>
<td>user2#mail.com</td>
<td>2</td>
<td>Student</td></tr>
<tr class="even">
The expression
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
matches any <tr> for which any single descendant text node has the exact content user2 (after space normalization).
Note that this won't match anything in your example html. That example seems to be broken, because there's only one <tr> there, and it has no content that we can see.
Addendum:
You asked, "how exactly does .//text()[] work"?
. selects the context node (which in the above case is a tr element).
//text() selects any text node that is a descendant (of the aforementioned tr element).
[...] gives a predicate that "filters" what the preceding expression selects. So in this case it filters all text nodes that are descendants of the context tr, keeping only those whose space-normalized text content is 'user2'.
All this, as a predicate for tr, means to filter the tr elements, keeping only those for which there is at least one descendant text node whose space-normalized text content is 'user2'.
As Michael Kay pointed out, that may or may not be exactly what you want, depending on whether you want to match a table cell that contains user2 spread across b or i elements.
Addendum 2:
Can someone help me fix what's wrong with the following expressions in
which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
What this expression matches is tr elements that have a direct child (not grandchild) text node whose value is exactly 'user2', e.g. <tr>textNode1<td>...</td>user2</tr>. Since text in tables is usually in a td element instead of directly under a tr, the above expression typically matches nothing.
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
Aside from space normalization, this expression also collapses the generality of the = comparison. In other words... The previous XPath expression asks whether the tr element has any text node child whose value is user2; but this one only asks whether the tr element's first text node child has a value user2.
Why? Because the normalize-space() function takes a single string value as its argument. So if you supply text() as the argument, and there are several text() children, you are supplying a node-set (or sequence in XPath 2.0). The node-set gets converted to a string by taking the string-value of the first node in the node-set.
To get a general comparison back, with normalization, you would use
//*[#id='row']/tbody/tr[text()[normalize-space(.)='user2']]
(The . argument is the default anyway, but I prefer making it explicit.) Again, this will only work with text nodes that are direct children of tr, so you'll probably want a descendant axis in there:
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
If you are trying to find the table cells (td) elements that contain the exact value "user 2", then you want
//*[#id='row']/tbody/tr/td[. = 'user2']
People often misuse "contains" here because they think it has the same meaning as in the English sentence above, "a node contains a value". But that's what "=" does in XPath; the XPath contains() function tests whether the content of the node has a substring equal to "user2".
Don't use text() here. The text() expression selects individual text nodes. But your content isn't necessarily all part of the same text node, for example it might be "user<b>2</b>".

Resources