I'm fetching data with python requests & xpath.
<div class="test">
<p>pppp</p>
aaa
<em>bbb</em>
ccc
<span>span</span>
</div>
I want to get aaabbbccc.
I tried //div/*[not(self::p) and not(self::span)]//text() to exclude the p and span element, but it only returns bbb.
What is the correct path?
If the element structure is totally predictable and only the content of text nodes varies, then you can use //div/node()[not(self::p|self::span)]/descendant-or-self::text(). Note that this returns a sequence of text nodes, not a single string. This may also return some whitespace text nodes which you may want to filter out with the predicate [normalize-space(.)].
Another possibility would be //text()[not(parent::p|parent::span)].
Related
I have to write a regular expression to match child price i.e "18.99", since there are multiple span class with same class name i.e "currency & price-value", I wanted to write regex from CHILD, again here 4-11 is dynamic data, it can change.
<p class="price">CHILD 4-11yrs<br />
<span class="currency">£</span>
<span class="price-value">18.99</span></p>
Wanted a regex which identifies from CHILD to fetch the price.
Can anyone help me with this. Thanks in advance.
Since you just want to check if the span tag you need to get the value of contains a literal substring CHILD, you may as well use an XPath_Extractor and the following XPath query:
//span[parent::p[contains(text(),'CHILD')] and #class='price-value']/text()
Details:
//span - get me a span tag...
[parent::p[contains(text(),'CHILD')] - whose parent tag is p and whose value contains CHILD substring
and - AND...
#class='price-value'] - the class attribute value is price-value...
/text() - and fetch me the value of that span.
NOTE: If the p tag starts with the CHILD, you may as well use starts-with:
//span[parent::p[starts-with(text(),'CHILD')] and #class='price-value']/text()
^^^^^^^^^^^
To extract HTML content, it is much better and easier to use CSS/ JQuery Extractor.
To extract what you need expression will be:
span.price-value
Leave attribute empty and that's it
I have a following element.
driver = Selenium::WebDriver.for :phantomjs
driver.xpath("/html/body/form/table/tbody/tr[14]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/font").text
=> "unique\ntext"
But I don't want to rely on unstable table layout, so I decided to use text() function in xpath like:
driver.xpath("//font[text()='unique\ntext']")
=> nil
But as you see, I couldn't find the element by the text() function. The original text is unique<br>text.
How can I match the <br> tag by using XPath?
There is no id or name attributes that I can use.
The text() test selects any text nodes. In this example there are two such nodes, before and after the <br>. It is not the same as the text method or the string value of the parent node.
One way of selecting what you want could be like this:
driver.xpath("//font[ . ='unique\ntext']")
You might need to add extra newlines before or after the text. Note that this relies on Ruby doing the conversion of \n into an actual newline character before passing the query to the XPath processor, so you need to be careful about getting your quotes right. This compares the string-value of the node, which for an element is the concatenation of all the descendent text nodes, which is what you want.
A better solution might be to use the normalize-space() function here (as long as the unique aspect of the text doesn’t depend on the newlines).
Try:
driver.xpath("//font[normalize-space()='unique text']")
Note that all leading and trailing whitespace in the target text has been removed, and any internal whitespace is changed to a single space character.
I'm trying to select things in a table, and currently have the following expression
//*[#id='row']/tbody/tr[contains(., 'user2')]/td[contains(., 'user2')]
however, this is obviously a problem when there are users entered such as 'user 25', because that also contains 'user 2'. Can someone help me fix what's wrong with the following expressions in which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
I tried normalizing space too, didn't seem to work
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
If it will help here is the html of the page
<table id="row" class="gradientTable">
<td>
user2
</td>
<td>User2</td>
<td>User2</td>
<td>user2#mail.com</td>
<td>2</td>
<td>Student</td></tr>
<tr class="even">
The expression
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
matches any <tr> for which any single descendant text node has the exact content user2 (after space normalization).
Note that this won't match anything in your example html. That example seems to be broken, because there's only one <tr> there, and it has no content that we can see.
Addendum:
You asked, "how exactly does .//text()[] work"?
. selects the context node (which in the above case is a tr element).
//text() selects any text node that is a descendant (of the aforementioned tr element).
[...] gives a predicate that "filters" what the preceding expression selects. So in this case it filters all text nodes that are descendants of the context tr, keeping only those whose space-normalized text content is 'user2'.
All this, as a predicate for tr, means to filter the tr elements, keeping only those for which there is at least one descendant text node whose space-normalized text content is 'user2'.
As Michael Kay pointed out, that may or may not be exactly what you want, depending on whether you want to match a table cell that contains user2 spread across b or i elements.
Addendum 2:
Can someone help me fix what's wrong with the following expressions in
which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
What this expression matches is tr elements that have a direct child (not grandchild) text node whose value is exactly 'user2', e.g. <tr>textNode1<td>...</td>user2</tr>. Since text in tables is usually in a td element instead of directly under a tr, the above expression typically matches nothing.
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
Aside from space normalization, this expression also collapses the generality of the = comparison. In other words... The previous XPath expression asks whether the tr element has any text node child whose value is user2; but this one only asks whether the tr element's first text node child has a value user2.
Why? Because the normalize-space() function takes a single string value as its argument. So if you supply text() as the argument, and there are several text() children, you are supplying a node-set (or sequence in XPath 2.0). The node-set gets converted to a string by taking the string-value of the first node in the node-set.
To get a general comparison back, with normalization, you would use
//*[#id='row']/tbody/tr[text()[normalize-space(.)='user2']]
(The . argument is the default anyway, but I prefer making it explicit.) Again, this will only work with text nodes that are direct children of tr, so you'll probably want a descendant axis in there:
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
If you are trying to find the table cells (td) elements that contain the exact value "user 2", then you want
//*[#id='row']/tbody/tr/td[. = 'user2']
People often misuse "contains" here because they think it has the same meaning as in the English sentence above, "a node contains a value". But that's what "=" does in XPath; the XPath contains() function tests whether the content of the node has a substring equal to "user2".
Don't use text() here. The text() expression selects individual text nodes. But your content isn't necessarily all part of the same text node, for example it might be "user<b>2</b>".
I was writing an XPath expression, and I had a strange error which I fixed, but what is the difference between the following two XPath expressions?
"//td[starts-with(normalize-space()),'Posted Date:')]"
and
"//td[starts-with(normalize-space(text()),'Posted Date:')]"
Mainly, what will the first XPath expression catch? Because I was getting a lot of strange results. So what does the text() make in the matching? Also, is there is a difference if I said normalize-space() & normalize-space(.)?
Well, the real question is: what's the difference between . and text()?
. is the current node. And if you use it where a string is expected (i.e. as the parameter of normalize-space()), the engine automatically converts the node to the string value of the node, which for an element is all the text nodes within the element concatenated. (Because I'm guessing the question is really about elements.)
text() on the other hand only selects text nodes that are the direct children of the current node.
So for example given the XML:
<a>Foo
<b>Bar</b>
lish
</a>
and assuming <a> is your current node, normalize-space(.) will return Foo Bar lish, but normalize-space(text()) will fail, because text() returns a nodeset of two text nodes (Foo and lish), which normalize-space() doesn't accept.
To cut a long story short, if you want to normalize all the text within an element, use .. If you want to select a specific text node, use text(), but always remember that despite its name, text() returns a nodeset, which is only converted to a string automatically if it has a single element.
This is the data:
<p>
<span class="z">XXX</span><br/>
123456<br/>
78910</p>
There are also whitespaces and newlines all over the place.
I need to get only '<br/>123456<br/>78910' skipping the span element.
When I run this xpath: '//p/text()' I get 3 matches: The first - a bunch of spaces and newlines, the second one with 123456 and the third one with 78910.
Is there any other way to skip the span element? Is it possible to somehow join the matches?
It looks like you are trying to select every node after the span element:
/p/span/following-sibling::node()
If you want the text node children without the white space only text nodes:
/p/text()[normalize-space()]
Use:
/p/node()[not(self::span) and (not(self::text[not(normalize-space())]))]
This selects all nodes that a children of the top element p and that if they are text nodes they are not white-space only.