How are these two XPath expressions different? - xpath

I'm parsing a website using XPath. I've got two queries, one which finds the node I'm looking for:
//td[.//text()[contains(., "Date Filed:")]]
And one which doesn't:
//td[contains(.//text(), "Date Filed:")]
I don't understand how these are different. I'd read them both to mean, "Find td nodes which have a descendant text node containing Date Filed."
Can anybody explain how these are different?
Here's the HTML, though I don't think it's relevant to the question:
<td width="40%" valign="top">
<br><br><br><br><br>
<b>Date Filed:</b> 11/13/2008<br>
<b>Jury Demand: </b> No<br><br>
<br><b>Date Terminated: </b><br>
<br><b>Date Reopened: </b><br>
<br><b>Does this action raise an issue of constitutionality?: </b>Y<br>
</td>
(Don't look at me that way. I didn't make the website, the U.S. Gov't did.)

That is how string conversion works in XPath:
In the second query contains(.//text(), "Date Filed:") you call contains function. It accepts two arguments of type string, you first parameter .//text() is node-set datatype, which means string function gets called internally to convert list of nodes to string. In this case string(.//text()) returns only first text node. If you change your second query to this: //td[contains(., "Date Filed:")] it will select the wanted td.

In XPath 1.0, if you supply a node-set to a function like contains() that expects a string, it uses the string-value of the first node in the node-set (in document order).
In XPath 2.0 and later versions, if you supply a node-set to a function like contains() that expects a string, the node-set is atomized, and if the result contains more than one string (which will normally be the case when more than one node is selected), then you get a type error XPTY0004.
When you ask questions about XSLT or XPath on StackOverflow, please always say which version you are talking about.

Related

Possible to run two completely different x-path

Can anyone please help me here ?
I want to run two xpath together and store the value, I am not sure if it is possible.
My one xpath is fetching City and second is state
//div[(text()='city')]/following-sibling::div
//div[contains(text(),'state')]/following-sibling::div
As xpath is telling name of city and state is provided in next div of city and state. I want to run both and capture output in string format.
On side note: both xpath is working fine for me.
<div>
<div>City</div>
<div>London</div>
</div>
<--In between some other elements like p, section other divs-->
<div>
<div>state</div>
<div>England</div>
</div>
It sounds like you want to convert the results of the two XPath expressions to strings, and concatenate those strings. The expression below concatenates them (with a single space between) using the XPath concat function.
concat(
//div[(text()='city')]/following-sibling::div,
' ',
//div[contains(text(),'state')]/following-sibling::div
)
One other thing: note that in your example XML the text of the first div is "City" rather than "city". Make sure the strings in your XPath expression match the text exactly because the expression 'City'='city' evaluates to false

Xpath to strip text using substring-after

I have the following which is the second span in html with the class of 'ProductListOurRef':
<span class="ProductListOurRef">Product Code: 60076</span>
Ive tried the following Xpath:
(//span[#class="ProductListOurRef"])[2]
But that returns 'Product Code: 60076'. But I need to use Xpath to strip the 'Product Code: ' to just give me the result of '60076'.
I believe 'substring-after' should do it but i dont know how to write it
If you are using XPath 1.0, then the result of an XPath expression must be either a node-set, a single string, a single number, or a single boolean.
As shown in comments on the question, you can write a query using substring-after(), whose result is a string.
However, some applications expect the result of an XPath expression always to be a node-set, and it looks as if you are stuck with such an application. Because you can't construct new nodes in XPath (you can only select nodes that are already present in the input), there is no way around this.

Contains(.,'text') function for matching text

I'm trying to select things in a table, and currently have the following expression
//*[#id='row']/tbody/tr[contains(., 'user2')]/td[contains(., 'user2')]
however, this is obviously a problem when there are users entered such as 'user 25', because that also contains 'user 2'. Can someone help me fix what's wrong with the following expressions in which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
I tried normalizing space too, didn't seem to work
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
If it will help here is the html of the page
<table id="row" class="gradientTable">
<td>
user2
</td>
<td>User2</td>
<td>User2</td>
<td>user2#mail.com</td>
<td>2</td>
<td>Student</td></tr>
<tr class="even">
The expression
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
matches any <tr> for which any single descendant text node has the exact content user2 (after space normalization).
Note that this won't match anything in your example html. That example seems to be broken, because there's only one <tr> there, and it has no content that we can see.
Addendum:
You asked, "how exactly does .//text()[] work"?
. selects the context node (which in the above case is a tr element).
//text() selects any text node that is a descendant (of the aforementioned tr element).
[...] gives a predicate that "filters" what the preceding expression selects. So in this case it filters all text nodes that are descendants of the context tr, keeping only those whose space-normalized text content is 'user2'.
All this, as a predicate for tr, means to filter the tr elements, keeping only those for which there is at least one descendant text node whose space-normalized text content is 'user2'.
As Michael Kay pointed out, that may or may not be exactly what you want, depending on whether you want to match a table cell that contains user2 spread across b or i elements.
Addendum 2:
Can someone help me fix what's wrong with the following expressions in
which I tried to match the text values exactly? (just the row for now)
//*[#id='row']/tbody/tr[text()='user2']
What this expression matches is tr elements that have a direct child (not grandchild) text node whose value is exactly 'user2', e.g. <tr>textNode1<td>...</td>user2</tr>. Since text in tables is usually in a td element instead of directly under a tr, the above expression typically matches nothing.
//*[#id='row']/tbody/tr[normalize-space(text())='user2']
Aside from space normalization, this expression also collapses the generality of the = comparison. In other words... The previous XPath expression asks whether the tr element has any text node child whose value is user2; but this one only asks whether the tr element's first text node child has a value user2.
Why? Because the normalize-space() function takes a single string value as its argument. So if you supply text() as the argument, and there are several text() children, you are supplying a node-set (or sequence in XPath 2.0). The node-set gets converted to a string by taking the string-value of the first node in the node-set.
To get a general comparison back, with normalization, you would use
//*[#id='row']/tbody/tr[text()[normalize-space(.)='user2']]
(The . argument is the default anyway, but I prefer making it explicit.) Again, this will only work with text nodes that are direct children of tr, so you'll probably want a descendant axis in there:
//*[#id='row']/tbody/tr[.//text()[normalize-space(.)='user2']]
If you are trying to find the table cells (td) elements that contain the exact value "user 2", then you want
//*[#id='row']/tbody/tr/td[. = 'user2']
People often misuse "contains" here because they think it has the same meaning as in the English sentence above, "a node contains a value". But that's what "=" does in XPath; the XPath contains() function tests whether the content of the node has a substring equal to "user2".
Don't use text() here. The text() expression selects individual text nodes. But your content isn't necessarily all part of the same text node, for example it might be "user<b>2</b>".

multiple string() results for an xpath?

string()
works great on a certain webpage I am trying to extract text from.
http://www.bing.com/search?q=lemons&first=111&FORM=PERE
has similar structure. For bing, the xpath I have tried is
string(//h3/a)
which works great to get the search results, even with strong tags etc, but only returns the first result. Is there something like strings(), so I can get the full text of each
//h3/a
result?
Is there something like strings(), so I can get the full text of each
//h3/a
result?
No, Not in XPath 1.0.
From the W3C XPath 1.0 Specification (the only normative document about XPath 1.0):
"Function: string string(object?)
The string function converts an object to a string as follows:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order."
So, if you only have an XPath 1.0 engine available, you need to select the node-set of all //h3/a elements and then in your programming language that is hosting XPath, to iterate on each node and get its string value separately.
In XPath 2.0 use:
//h3/a/string()
The result of evaluating this XPath 2.0 expression is a sequence of strings, each of which is the string value of one of the//h3/a elements.
The MSDN documentation of string remarks that:
The string() function converts a node-set to a string by returning the string value of the first node in the node-set, which in some instances may yield unexpected results.
This sounds like what you are experiencing. Why are you using string() at all?
Use //h3/a/text()

How to use the "translate" Xpath function on a node-set

I have an XML document that contains items with dashes I'd like to strip
e.g.
<xmlDoc>
<items>
<item>a-b-c</item>
<item>c-d-e</item>
<items>
</xmlDoc>
I know I can find-replace a single item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the entire set?
This doesn't work
/xmldoc/items/item/translate(text(),'-','')
Nor this
translate(/xmldoc/items/item/text(),'-','')
Is there a way at all to achieve that?
I know I can find-replace a single
item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the
entire set?
This cannot be done with a single XPath 1.0 expression.
Use the following XPath 2.0 expression to produce a sequence of strings, each being the result of the application of the translate() function on the string value of the corresponding node:
/xmlDoc/items/item/translate(.,'-', '')
The translate function accepts in input a string and not a node-set. This means that writing something like:
"translate(/xmlDoc/items/item/text(),'-','')"
or
"translate(/xmlDoc/items/item,'-','')"
will result in a function call on the first node only (item[1]).
In XPath 1.0 I think you have no other chances than doing something ugly like:
"concat(translate(/xmlDoc/items/item,'-',''),
translate(/xmlDoc/items/item[2],'-',''))"
Which is privative for a huge list of items and returns just a string.
In XPath 2.0 this can be solved nicely using for expressions:
"for $item in /xmlDoc/items/item
return replace($item,'-','')"
Which returns a sequence type:
abc cde
PS Do no confuse function calls with location paths. They are different kind of expressions, and in XPath 1.0 can not be mixed.
here is yet anther example, running it against chrome developer tools, in prepartion for a selenium test.
$x("//table[#id='sometable_table']//tr[1=1 and ./td[2=2 and position()=2 and .//*[translate(text(), ',', '') ='1001'] ] ]/td[position()=2]")
Essentially the the data sometable_table has a column containing numbers that appear localized. For example 1001 would appear as 1,001. With the above you have somewhat nasty xpath expression.
Where first you select all table rows. Then you focus on the data of the position 2 table data for the row. Then you go deeper into the contents of the position=2 table data expand the data on the cell until you find any node whose text after string replacement is 1001. Finally you ask for the table at position 2 to be returned.
But since all your main filters are at the table row level, you could be doing additional filters at table data columns at other positions as well, if you need to find the appropriate table row that has content (A) on a cell column and content (B) on a different column.
NOTE:
It was actually quite nasty to write this, because intuitively, we all google for XPATH replace string. So I was getting furstrated trying to use xpath replace until i realized chrome supports XPATH 1.0. In xpath 1.0 the string functions that exist are different from xpath 2.0, you need to use this translate function.
See reference:
http://www.edankert.com/xpathfunctions.html

Resources