How to exclude nodes surrounding a text node? - xpath

I feel like I'm missing something basic but I can't figure it out; given this xml:
<p>
<tag>good text</tag>
<tag>this may be good </tag>
bad text
<tag>some other bad text</tag>
<tag>last good text</tag>
</p>
I would like to select everything EXCEPT the text node (bad text) and the immediately following tag node. Obviously, the number of good tags and standalone text node varies, so I can't rely on their absolute positions.
I know that
p/text()
selects bad text and
//p/*
selects all p children while excluding bad text. But I can't figure out how to end up with only the first, second and fourth tags, in this example.
Desired output:
<p>
<tag>good text</tag>
<tag>this may be good</tag>
<tag>last good text</tag>
</p>

This XPath 1.0 expression:
/p/*[not(preceding-sibling::node()[1][normalize-space(self::text())='bad text'])]
It selects:
<tag>good text</tag>
<tag>this may be good or bad</tag>
<tag>last good text</tag>
Meaning:
Select child elements of p not having as first preceding node a text node with "bad text" string as space normalized string value.
Check: http://www.xpathtester.com/xpath/96aa0415f3512b0a84ad1e2330e0278f

Related

How to eliminate line breaks and spaces around text in xPath

There is a page like the following:
<html>
<head></head>
<body>
<p> 5-8 </p>
<p></br>5-8</br></p>
<p> </br>5-8&nbsp</br></p>
</body>
</html>
The goal is to abstract the text in each p, the breaks and whitespaces are not wanted.
How to achieve that?
Thanks in advance! Best Wishes!
--The first Updating
Another post suggested using normalize_space(). I tried that, well, It can remove the spaces. However, only one node is left. How can I get all 30 node text without unwanted spaces? Thanks in advance and Best wishes!
enter image description here
It's not possible to achieve what you want entirely in XPath 1.0, but in XPath 2.0 or later it is possible.
You don't say what XPath interpreter you have available but you mention Chrome's XPath Helper which relies on Chrome's built in XPath interpreter which supports XPath 1.0 (as is the norm for web browsers).
But it's possible you are just using Chrome to examine the data, and have another, more modern XPath interpreter such as e.g. Saxon. If so, an XPath 2.0 solution will work for you, though you won't be able to use it in Chrome, obviously.
I've tidied up your XML example:
<html>
<head></head>
<body>
<p>  5-8  </p>
<p><br/>5-8<br/></p>
<p> <br/>5-8 <br/></p>
</body>
</html>
NB those are non-breaking spaces there.
In XPath 2.0:
for $paragraph in //p
return normalize-space(
translate($paragraph, codepoints-to-string(160), ' ')
)
NB this uses the translate function to convert non-breaking spaces (the char with Unicode codepoint 160) to a space, and then uses normalize-space to trim leading and trailing whitespace (I'm not sure what you would want to do if there were whitespace in the middle of the para, instead of just at the start or end; this will convert any such sequence of whitespace to a single space character). You might think normalize-space would be enough, but in fact a non-breaking-space doesn't fall into normalize-space's category of "white space" so it would not be trimmed.
In XPath 1.0 is not exactly possible to do what you want. You could use an XPath expression that would return each p element to your host language, and then iterate over those p elements, executing a second XPath expression for each one, with that p as the context. Essentially this mean moving the for ... in ... return iterator from XPath into your host language. To select the paragraphs:
//p
... and then for each one:
normalize-space(
translate(., ' ', ' ')
)
NB in that expression, the first string literal is a non-breaking-space character, and the second is a space. XPath 1.0 doesn't have the codepoints-to-string function or I'd have used that, for clarity.
The . which is the first parameter to the translate function represents the context node (the current node). When you execute this XPath expression in your host language you need to pass one of the p elements as the context node. You don't say what host language you're using, but in JavaScript, for instance, you could use the document.evaluate function to execute the first XPath, receiving an iterator of p elements. Then for each element, you'd call its evaluate method to execute the second XPath, and that would ensure that the p element was the context node for the XPath (i.e. the . in the expression).

xpath how to extract the element itself and one of its child?

I'm fetching data with python requests & xpath.
<div class="test">
<p>pppp</p>
aaa
<em>bbb</em>
ccc
<span>span</span>
</div>
I want to get aaabbbccc.
I tried //div/*[not(self::p) and not(self::span)]//text() to exclude the p and span element, but it only returns bbb.
What is the correct path?
If the element structure is totally predictable and only the content of text nodes varies, then you can use //div/node()[not(self::p|self::span)]/descendant-or-self::text(). Note that this returns a sequence of text nodes, not a single string. This may also return some whitespace text nodes which you may want to filter out with the predicate [normalize-space(.)].
Another possibility would be //text()[not(parent::p|parent::span)].

xpath normalize-space with contains [duplicate]

This question already has answers here:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
(7 answers)
Closed 4 years ago.
I have an xpath string //*[normalize-space() = "some sub text"]/text()/.. which works fine if the text I am finding is in a node which does not have multiple text sub nodes, but if it does then it won't work, so I am trying to combine it with contains() as follows: //*[contains(normalize-space(), "some sub text")]/text()/.. which does work, but it always returns the body and html tags as well as the p tag which contains the text. How can I change it so it only returns the p tag?
It depends exactly what you want to match.
The most likely scenario is that you want to match some text if it appears anywhere in the normalized string value of the element, possibly split across multiple text nodes at different levels: for example any of the following:
<p>some text</p>
<p>There was some text</p>
<p>There was <b>some text</b></p>
<p>There <b>was</b> some text</p>
<p>There was <b>some</b> <!--italic--> <i>text</i></p>
<p>There was <b>some</b> text</p>
If that's the case, then use //p[contains(normalize-space(.), "some text")].
As you point out, using //* with this predicate will also match ancestor elements of the relevant element. The simplest way to fix this is by using //p to say what element you are looking for. If you don't know what element you are looking for, then in XPath 3.0 you could use
innermost(//*[contains(normalize-space(.), "some text")])
but if you have the misfortune not to be using XPath 3.0, then you could do (//*[contains(normalize-space(.), "some text")])[last()], though this doesn't do quite the same thing if there are multiple paragraphs with the required content.
If you don't want to match all of the above, but want to be more selective, then you need to explain your requirements more clearly.
Either way, use of text() in a path expression is generally a code smell, except in the rare cases where you want to select text in an element only if it is not wrapped in other tags.

Contains(.,'text') function for matching text

I'm trying to select things in a table, and currently have the following expression
//*[#id='row']/tbody/tr[contains(., 'user2')]/td[contains(., 'user2')]
however, this is obviously a problem when there are users entered such as 'user 25', because that also contains 'user 2'. Can someone help me fix what's wrong with the following expressions in which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
I tried normalizing space too, didn't seem to work
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
If it will help here is the html of the page
<table id="row" class="gradientTable">
<td>
user2
</td>
<td>User2</td>
<td>User2</td>
<td>user2#mail.com</td>
<td>2</td>
<td>Student</td></tr>
<tr class="even">
The expression
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
matches any <tr> for which any single descendant text node has the exact content user2 (after space normalization).
Note that this won't match anything in your example html. That example seems to be broken, because there's only one <tr> there, and it has no content that we can see.
Addendum:
You asked, "how exactly does .//text()[] work"?
. selects the context node (which in the above case is a tr element).
//text() selects any text node that is a descendant (of the aforementioned tr element).
[...] gives a predicate that "filters" what the preceding expression selects. So in this case it filters all text nodes that are descendants of the context tr, keeping only those whose space-normalized text content is 'user2'.
All this, as a predicate for tr, means to filter the tr elements, keeping only those for which there is at least one descendant text node whose space-normalized text content is 'user2'.
As Michael Kay pointed out, that may or may not be exactly what you want, depending on whether you want to match a table cell that contains user2 spread across b or i elements.
Addendum 2:
Can someone help me fix what's wrong with the following expressions in
which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
What this expression matches is tr elements that have a direct child (not grandchild) text node whose value is exactly 'user2', e.g. <tr>textNode1<td>...</td>user2</tr>. Since text in tables is usually in a td element instead of directly under a tr, the above expression typically matches nothing.
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
Aside from space normalization, this expression also collapses the generality of the = comparison. In other words... The previous XPath expression asks whether the tr element has any text node child whose value is user2; but this one only asks whether the tr element's first text node child has a value user2.
Why? Because the normalize-space() function takes a single string value as its argument. So if you supply text() as the argument, and there are several text() children, you are supplying a node-set (or sequence in XPath 2.0). The node-set gets converted to a string by taking the string-value of the first node in the node-set.
To get a general comparison back, with normalization, you would use
//*[#id='row']/tbody/tr[text()[normalize-space(.)='user2']]
(The . argument is the default anyway, but I prefer making it explicit.) Again, this will only work with text nodes that are direct children of tr, so you'll probably want a descendant axis in there:
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
If you are trying to find the table cells (td) elements that contain the exact value "user 2", then you want
//*[#id='row']/tbody/tr/td[. = 'user2']
People often misuse "contains" here because they think it has the same meaning as in the English sentence above, "a node contains a value". But that's what "=" does in XPath; the XPath contains() function tests whether the content of the node has a substring equal to "user2".
Don't use text() here. The text() expression selects individual text nodes. But your content isn't necessarily all part of the same text node, for example it might be "user<b>2</b>".

What is the difference between normalize-space(.) and normalize-space(text())?

I was writing an XPath expression, and I had a strange error which I fixed, but what is the difference between the following two XPath expressions?
"//td[starts-with(normalize-space()),'Posted Date:')]"
and
"//td[starts-with(normalize-space(text()),'Posted Date:')]"
Mainly, what will the first XPath expression catch? Because I was getting a lot of strange results. So what does the text() make in the matching? Also, is there is a difference if I said normalize-space() & normalize-space(.)?
Well, the real question is: what's the difference between . and text()?
. is the current node. And if you use it where a string is expected (i.e. as the parameter of normalize-space()), the engine automatically converts the node to the string value of the node, which for an element is all the text nodes within the element concatenated. (Because I'm guessing the question is really about elements.)
text() on the other hand only selects text nodes that are the direct children of the current node.
So for example given the XML:
<a>Foo
<b>Bar</b>
lish
</a>
and assuming <a> is your current node, normalize-space(.) will return Foo Bar lish, but normalize-space(text()) will fail, because text() returns a nodeset of two text nodes (Foo and lish), which normalize-space() doesn't accept.
To cut a long story short, if you want to normalize all the text within an element, use .. If you want to select a specific text node, use text(), but always remember that despite its name, text() returns a nodeset, which is only converted to a string automatically if it has a single element.

Resources