Perhaps my problem (and of many others) is only a misconception about what I see at XML code as tag: it is detected by XPath as by node()? Or text and attributes are also detected?
When ., *, #*, text() and node() can be used for "tag detection"?
PS: my guess is that only * and node() can tags (and . is like a * about "children of this tag")... But I think I am wrong.
Close; what you call a tag is an element in xml parlance, and an element is a type of node, as are attributes, text, comments, etc.
In terms of XPath expressions, node() selects all nodes, irrespective of type, whilst * selects nodes of type element, and #* would give you the attributes.
Related
For purposes to automatically replace keywords with links based on a list of keyword-link pairs I need to get text that is not already linked, not a script or manually excluded, inside paragraphs (p) and list items (li) –- to be used in Drupal's Alinks module.
I modified the existing xpath selector as follows and would like to get feedback on it, if it is efficient or might be improved:
//*[p or li]//text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
The xpath is meant to work with any html5 content, also with self closing tags (not well-formed xml) -- that's the way the module was designed, and it works quite well.
In order to select text node descendant of p or li elements that are not descendant of a or script elements, you can use this XPath 1.0:
//*[self::p|self::li]
//text()[
not(ancestor::a|ancestor::script|ancestor::*[#data-alink-ignore])
]
Your XPath expression is invalid. You are missing a / before text(). So a valid expression would be
//*[p or li]/text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
But without an XML source file it is impossible to tell if this expression would match your desired node.
I have a following element.
driver = Selenium::WebDriver.for :phantomjs
driver.xpath("/html/body/form/table/tbody/tr[14]/td/table/tbody/tr/td/table/tbody/tr[1]/td[1]/font").text
=> "unique\ntext"
But I don't want to rely on unstable table layout, so I decided to use text() function in xpath like:
driver.xpath("//font[text()='unique\ntext']")
=> nil
But as you see, I couldn't find the element by the text() function. The original text is unique<br>text.
How can I match the <br> tag by using XPath?
There is no id or name attributes that I can use.
The text() test selects any text nodes. In this example there are two such nodes, before and after the <br>. It is not the same as the text method or the string value of the parent node.
One way of selecting what you want could be like this:
driver.xpath("//font[ . ='unique\ntext']")
You might need to add extra newlines before or after the text. Note that this relies on Ruby doing the conversion of \n into an actual newline character before passing the query to the XPath processor, so you need to be careful about getting your quotes right. This compares the string-value of the node, which for an element is the concatenation of all the descendent text nodes, which is what you want.
A better solution might be to use the normalize-space() function here (as long as the unique aspect of the text doesn’t depend on the newlines).
Try:
driver.xpath("//font[normalize-space()='unique text']")
Note that all leading and trailing whitespace in the target text has been removed, and any internal whitespace is changed to a single space character.
I'm trying to select things in a table, and currently have the following expression
//*[#id='row']/tbody/tr[contains(., 'user2')]/td[contains(., 'user2')]
however, this is obviously a problem when there are users entered such as 'user 25', because that also contains 'user 2'. Can someone help me fix what's wrong with the following expressions in which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
I tried normalizing space too, didn't seem to work
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
If it will help here is the html of the page
<table id="row" class="gradientTable">
<td>
user2
</td>
<td>User2</td>
<td>User2</td>
<td>user2#mail.com</td>
<td>2</td>
<td>Student</td></tr>
<tr class="even">
The expression
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
matches any <tr> for which any single descendant text node has the exact content user2 (after space normalization).
Note that this won't match anything in your example html. That example seems to be broken, because there's only one <tr> there, and it has no content that we can see.
Addendum:
You asked, "how exactly does .//text()[] work"?
. selects the context node (which in the above case is a tr element).
//text() selects any text node that is a descendant (of the aforementioned tr element).
[...] gives a predicate that "filters" what the preceding expression selects. So in this case it filters all text nodes that are descendants of the context tr, keeping only those whose space-normalized text content is 'user2'.
All this, as a predicate for tr, means to filter the tr elements, keeping only those for which there is at least one descendant text node whose space-normalized text content is 'user2'.
As Michael Kay pointed out, that may or may not be exactly what you want, depending on whether you want to match a table cell that contains user2 spread across b or i elements.
Addendum 2:
Can someone help me fix what's wrong with the following expressions in
which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
What this expression matches is tr elements that have a direct child (not grandchild) text node whose value is exactly 'user2', e.g. <tr>textNode1<td>...</td>user2</tr>. Since text in tables is usually in a td element instead of directly under a tr, the above expression typically matches nothing.
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
Aside from space normalization, this expression also collapses the generality of the = comparison. In other words... The previous XPath expression asks whether the tr element has any text node child whose value is user2; but this one only asks whether the tr element's first text node child has a value user2.
Why? Because the normalize-space() function takes a single string value as its argument. So if you supply text() as the argument, and there are several text() children, you are supplying a node-set (or sequence in XPath 2.0). The node-set gets converted to a string by taking the string-value of the first node in the node-set.
To get a general comparison back, with normalization, you would use
//*[#id='row']/tbody/tr[text()[normalize-space(.)='user2']]
(The . argument is the default anyway, but I prefer making it explicit.) Again, this will only work with text nodes that are direct children of tr, so you'll probably want a descendant axis in there:
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
If you are trying to find the table cells (td) elements that contain the exact value "user 2", then you want
//*[#id='row']/tbody/tr/td[. = 'user2']
People often misuse "contains" here because they think it has the same meaning as in the English sentence above, "a node contains a value". But that's what "=" does in XPath; the XPath contains() function tests whether the content of the node has a substring equal to "user2".
Don't use text() here. The text() expression selects individual text nodes. But your content isn't necessarily all part of the same text node, for example it might be "user<b>2</b>".
I was writing an XPath expression, and I had a strange error which I fixed, but what is the difference between the following two XPath expressions?
"//td[starts-with(normalize-space()),'Posted Date:')]"
and
"//td[starts-with(normalize-space(text()),'Posted Date:')]"
Mainly, what will the first XPath expression catch? Because I was getting a lot of strange results. So what does the text() make in the matching? Also, is there is a difference if I said normalize-space() & normalize-space(.)?
Well, the real question is: what's the difference between . and text()?
. is the current node. And if you use it where a string is expected (i.e. as the parameter of normalize-space()), the engine automatically converts the node to the string value of the node, which for an element is all the text nodes within the element concatenated. (Because I'm guessing the question is really about elements.)
text() on the other hand only selects text nodes that are the direct children of the current node.
So for example given the XML:
<a>Foo
<b>Bar</b>
lish
</a>
and assuming <a> is your current node, normalize-space(.) will return Foo Bar lish, but normalize-space(text()) will fail, because text() returns a nodeset of two text nodes (Foo and lish), which normalize-space() doesn't accept.
To cut a long story short, if you want to normalize all the text within an element, use .. If you want to select a specific text node, use text(), but always remember that despite its name, text() returns a nodeset, which is only converted to a string automatically if it has a single element.
I have such content of html file:
<a class="bf" title="Link to book" href="/book/229920/">book name</a>
Help me to construct xpath expression to get link text (book name).
I try to use /a, but expression evaluates without results.
If the context is the entire document you should probably use // instead of /. Also you may (not sure about that) need to get down one more level to retrieve the text.
I think it should look like this
//a/text()
EDIT: As Tomalak pointed out it's text() not text
Have you tried
//a
?
More specific is better:
//a[#class='bf' and starts-with(#href, '/book/')]
Note that this selects the <a> element. In your host environment it's easy to extract the text value of that node via standard DOM methods (like the .textContent property).
To select the actual text node, see the other answers in this thread.
It depends also on the rest of your document. If you use // in the beginning all the matching nodes will be returned, which might be too many results in case you have other links in your document.
Apart from that a possible xpath expression is //a/text().
The /a you tried only returns the a-tag itself, if it is the root element. To get the link text you need to append the /text() part.