multiple string() results for an xpath? - xpath

string()
works great on a certain webpage I am trying to extract text from.
http://www.bing.com/search?q=lemons&first=111&FORM=PERE
has similar structure. For bing, the xpath I have tried is
string(//h3/a)
which works great to get the search results, even with strong tags etc, but only returns the first result. Is there something like strings(), so I can get the full text of each
//h3/a
result?

Is there something like strings(), so I can get the full text of each
//h3/a
result?
No, Not in XPath 1.0.
From the W3C XPath 1.0 Specification (the only normative document about XPath 1.0):
"Function: string string(object?)
The string function converts an object to a string as follows:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order."
So, if you only have an XPath 1.0 engine available, you need to select the node-set of all //h3/a elements and then in your programming language that is hosting XPath, to iterate on each node and get its string value separately.
In XPath 2.0 use:
//h3/a/string()
The result of evaluating this XPath 2.0 expression is a sequence of strings, each of which is the string value of one of the//h3/a elements.

The MSDN documentation of string remarks that:
The string() function converts a node-set to a string by returning the string value of the first node in the node-set, which in some instances may yield unexpected results.
This sounds like what you are experiencing. Why are you using string() at all?
Use //h3/a/text()

Related

XDMP-REGEX: (err:FORX0002) - String transformation with Regular expressions

I am working on xquery requirement to identify the xml tag name() from the XML document using the regex. Later , will do the transformation on data.It searches the entire document and If i found match, am doing string :replace using xquery/xpath.
Please find some sample code which am looking for.
let $full-doc := fn:doc($uri)
if(fn:matches($full-doc,"<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9]
[12][890][0-9]{2})</Hyperlink>"))
then $full-doc
else "regex is not working"
I am getting the following Error.
regex-match :
[1.0-ml] XDMP-REGEX: (err:FORX0002) fn:matches(fn:doc("44215.xml"), "
<Hyperlink\b[^\>]*?>([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2}...") -
- Invalid regular expression
Could some one please explain why my regex is not working ?
Looking at your requirement:
I am working on xquery requirement to identify the xml tag name() from the XML document using the regex.
You are going about this entirely the wrong way. XQuery doesn't see the lexical XML, it sees a tree of nodes. To find the name of an element, use an XPath expression to find the element, then use the name() function to get its name.
If you want to find an element whose name matches a regex, use //*[matches(name(), $regex)]
The word boundary code \b is not supported in XQuery (see https://www.w3.org/TR/xpath-functions-31/#regex-syntax).
But I guess you are looking for Hyperlink elements, not for a <Hyperlink> substring, so you should use a path expression:
let $doc := fn:doc($uri)
where $doc//Hyperlink[matches(., '([A-Z][a-z]{2} [0-3]?[0-9] [12][890][0-9]{2})')]
return $doc

Xpath to strip text using substring-after

I have the following which is the second span in html with the class of 'ProductListOurRef':
<span class="ProductListOurRef">Product Code: 60076</span>
Ive tried the following Xpath:
(//span[#class="ProductListOurRef"])[2]
But that returns 'Product Code: 60076'. But I need to use Xpath to strip the 'Product Code: ' to just give me the result of '60076'.
I believe 'substring-after' should do it but i dont know how to write it
If you are using XPath 1.0, then the result of an XPath expression must be either a node-set, a single string, a single number, or a single boolean.
As shown in comments on the question, you can write a query using substring-after(), whose result is a string.
However, some applications expect the result of an XPath expression always to be a node-set, and it looks as if you are stuck with such an application. Because you can't construct new nodes in XPath (you can only select nodes that are already present in the input), there is no way around this.

How are these two XPath expressions different?

I'm parsing a website using XPath. I've got two queries, one which finds the node I'm looking for:
//td[.//text()[contains(., "Date Filed:")]]
And one which doesn't:
//td[contains(.//text(), "Date Filed:")]
I don't understand how these are different. I'd read them both to mean, "Find td nodes which have a descendant text node containing Date Filed."
Can anybody explain how these are different?
Here's the HTML, though I don't think it's relevant to the question:
<td width="40%" valign="top">
<br><br><br><br><br>
<b>Date Filed:</b> 11/13/2008<br>
<b>Jury Demand: </b> No<br><br>
<br><b>Date Terminated: </b><br>
<br><b>Date Reopened: </b><br>
<br><b>Does this action raise an issue of constitutionality?: </b>Y<br>
</td>
(Don't look at me that way. I didn't make the website, the U.S. Gov't did.)
That is how string conversion works in XPath:
In the second query contains(.//text(), "Date Filed:") you call contains function. It accepts two arguments of type string, you first parameter .//text() is node-set datatype, which means string function gets called internally to convert list of nodes to string. In this case string(.//text()) returns only first text node. If you change your second query to this: //td[contains(., "Date Filed:")] it will select the wanted td.
In XPath 1.0, if you supply a node-set to a function like contains() that expects a string, it uses the string-value of the first node in the node-set (in document order).
In XPath 2.0 and later versions, if you supply a node-set to a function like contains() that expects a string, the node-set is atomized, and if the result contains more than one string (which will normally be the case when more than one node is selected), then you get a type error XPTY0004.
When you ask questions about XSLT or XPath on StackOverflow, please always say which version you are talking about.

How to use the "translate" Xpath function on a node-set

I have an XML document that contains items with dashes I'd like to strip
e.g.
<xmlDoc>
<items>
<item>a-b-c</item>
<item>c-d-e</item>
<items>
</xmlDoc>
I know I can find-replace a single item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the entire set?
This doesn't work
/xmldoc/items/item/translate(text(),'-','')
Nor this
translate(/xmldoc/items/item/text(),'-','')
Is there a way at all to achieve that?
I know I can find-replace a single
item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the
entire set?
This cannot be done with a single XPath 1.0 expression.
Use the following XPath 2.0 expression to produce a sequence of strings, each being the result of the application of the translate() function on the string value of the corresponding node:
/xmlDoc/items/item/translate(.,'-', '')
The translate function accepts in input a string and not a node-set. This means that writing something like:
"translate(/xmlDoc/items/item/text(),'-','')"
or
"translate(/xmlDoc/items/item,'-','')"
will result in a function call on the first node only (item[1]).
In XPath 1.0 I think you have no other chances than doing something ugly like:
"concat(translate(/xmlDoc/items/item,'-',''),
translate(/xmlDoc/items/item[2],'-',''))"
Which is privative for a huge list of items and returns just a string.
In XPath 2.0 this can be solved nicely using for expressions:
"for $item in /xmlDoc/items/item
return replace($item,'-','')"
Which returns a sequence type:
abc cde
PS Do no confuse function calls with location paths. They are different kind of expressions, and in XPath 1.0 can not be mixed.
here is yet anther example, running it against chrome developer tools, in prepartion for a selenium test.
$x("//table[#id='sometable_table']//tr[1=1 and ./td[2=2 and position()=2 and .//*[translate(text(), ',', '') ='1001'] ] ]/td[position()=2]")
Essentially the the data sometable_table has a column containing numbers that appear localized. For example 1001 would appear as 1,001. With the above you have somewhat nasty xpath expression.
Where first you select all table rows. Then you focus on the data of the position 2 table data for the row. Then you go deeper into the contents of the position=2 table data expand the data on the cell until you find any node whose text after string replacement is 1001. Finally you ask for the table at position 2 to be returned.
But since all your main filters are at the table row level, you could be doing additional filters at table data columns at other positions as well, if you need to find the appropriate table row that has content (A) on a cell column and content (B) on a different column.
NOTE:
It was actually quite nasty to write this, because intuitively, we all google for XPATH replace string. So I was getting furstrated trying to use xpath replace until i realized chrome supports XPATH 1.0. In xpath 1.0 the string functions that exist are different from xpath 2.0, you need to use this translate function.
See reference:
http://www.edankert.com/xpathfunctions.html

Invalid Token when using XPath

I am making a modification to a web application using XPath, and when executed I get an error message - Invalid token!
This is basic what I am doing
public xmlNode GetSelection (SelectParams params, xmldocument docment)
{
xpathstring = string.format("Name =\'{0}' Displaytag = \'{1}' Manadatory=\'{2}', params.Name, params.Displaytag, params.Manadatory);
return document.selectsinglenode(xpathstring);
}
As you can see, I am making a string and setting values on the nodes I am trying to find against my xml document, and thus returning xml data that matches my parameters.
What is happening is that I am getting an xpathexeception error in Visual Studio and it says invalid token.
I do know that in the xml document that the parameters I am looking in the tags have double quotes, for example, Name="ABC". So, I thought the problem could be solved using an "\".
Can anyone help?
Update from comments
In the Xml Document, the tag has
attributes where they are set as
Name="ABC" Displaytag="ATag"
Manadatory="true".
I guess you need:
//*[#Name="ABC"][#Displaytag="ATag"][#Manadatory="true"]
Or
//*[#Name="ABC" and #Displaytag="ATag" and #Manadatory="true"]
Meaning: any element in the whole document having a Name attribute with "ABC" value, a Displaytag attribute with "ATag" value and a Manadatory attribute with "true" value.
The string passed as argument to SelectSingleNode() (BTW, the exact capitalization is important) is something like:
Name ='someName' Displaytag = 'someString' Manadatory='true'
This is extremely different than a syntactically legal XPath expression.
And the error message just reflects the fact that toxic food has been given to the XPath engine.
Solution: Do read at least a light XPath tutorial and then specify a correct XPath expression.
The string you are constructing is not a valid XPath expression. In fact, it is nothing like XPath at all.
Indeed, even if it were a valid XPath expression, constructing it this way by string concatenation is a very dangerous practice, because of the possibility of injection attacks. But I suspect that advice will fall on stony ground.

Resources