XPath function on multiple text nodes - xpath

Using XPath 1.0 want to get list of text nodes applying XPath 'substring' function on every text node
substring(//p/text(), 10)
gives only one first text's sub-string, when
//p/text()
gives all of them, but want all sub-strings as set
EDIT:
Tried
//p/substring(text(), 10)
Says invalid XPath expression
How can I achieve this ?
Thanks in advance

If you want a set of strings as the result of an XPath 1.0 expression, then you're out of luck, because there is no such data type in XPath 1.0: the only collections available are collections of nodes, and you can only select nodes that already exist, you can't create new ones.
With XPath 2.0 this is a piece of cake:
//p/text()/substring(., 10)
So if you possibly can, find yourself an XPath 2.0 processor.

Related

How to get multiple occurences of an element with XPath under usage of normalize-space and substring-before

I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.

Evaluate xpath selector to get text in p- and li-tags

For purposes to automatically replace keywords with links based on a list of keyword-link pairs I need to get text that is not already linked, not a script or manually excluded, inside paragraphs (p) and list items (li) –- to be used in Drupal's Alinks module.
I modified the existing xpath selector as follows and would like to get feedback on it, if it is efficient or might be improved:
//*[p or li]//text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
The xpath is meant to work with any html5 content, also with self closing tags (not well-formed xml) -- that's the way the module was designed, and it works quite well.
In order to select text node descendant of p or li elements that are not descendant of a or script elements, you can use this XPath 1.0:
//*[self::p|self::li]
//text()[
not(ancestor::a|ancestor::script|ancestor::*[#data-alink-ignore])
]
Your XPath expression is invalid. You are missing a / before text(). So a valid expression would be
//*[p or li]/text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
But without an XML source file it is impossible to tell if this expression would match your desired node.

Xpath to strip text using substring-after

I have the following which is the second span in html with the class of 'ProductListOurRef':
<span class="ProductListOurRef">Product Code: 60076</span>
Ive tried the following Xpath:
(//span[#class="ProductListOurRef"])[2]
But that returns 'Product Code: 60076'. But I need to use Xpath to strip the 'Product Code: ' to just give me the result of '60076'.
I believe 'substring-after' should do it but i dont know how to write it
If you are using XPath 1.0, then the result of an XPath expression must be either a node-set, a single string, a single number, or a single boolean.
As shown in comments on the question, you can write a query using substring-after(), whose result is a string.
However, some applications expect the result of an XPath expression always to be a node-set, and it looks as if you are stuck with such an application. Because you can't construct new nodes in XPath (you can only select nodes that are already present in the input), there is no way around this.

XPath expression

This question regards XPath expressions.
I want to find the average of the length of all URLs in a Web page, that point to a .pdf file.
So far I have constructed the following expression, but it does not work:
sum(string-length(string(//a/#href[contains(., ".pdf")]))) div
count(//a/#href[contains(., ".pdf")])
Any help will be appreciated!
You will need XPath 2.0.
For calculating the sum of the string lengths, you will need either
need a concatenated string of all #hrefs to apply to string-lenght($string as xs:string) (which only allows a single string as parameter), but concat(...) only takes an arbitrary number of atomar strings, not a sequence of those; or
apply string-length(...) on every #href as #Navin Rawat proposed - but using arbitrary functions in axis steps is a new feature of XPath 2.0.
If using XPath 2.0, there are functions avg(...) and ends-with(...) which help you in stripping down the expression to
avg(//a/#href[ends-with(., '.pdf')]/string-length())
If you have to stick with XPath 1.0, all you can do is using my expression below to fetch the URLs and calculate the average outside XPath.
Anyway, the subexpression you proposed will fail at URLs like http://example.net/myfile.pdf.txt. Only compare the end of the URL:
//a[#href[substring(., string-length(.) - 3) = '.pdf']]/#href
And you missed a path step for the attribute, so you've been trying to average the string length of the link names right now.
Please put something like:
sum(//a/#href[contains(.,'.pdf')]/string-length()) div count(//a/#href[contains(.,'.pdf')])

How to use the "translate" Xpath function on a node-set

I have an XML document that contains items with dashes I'd like to strip
e.g.
<xmlDoc>
<items>
<item>a-b-c</item>
<item>c-d-e</item>
<items>
</xmlDoc>
I know I can find-replace a single item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the entire set?
This doesn't work
/xmldoc/items/item/translate(text(),'-','')
Nor this
translate(/xmldoc/items/item/text(),'-','')
Is there a way at all to achieve that?
I know I can find-replace a single
item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the
entire set?
This cannot be done with a single XPath 1.0 expression.
Use the following XPath 2.0 expression to produce a sequence of strings, each being the result of the application of the translate() function on the string value of the corresponding node:
/xmlDoc/items/item/translate(.,'-', '')
The translate function accepts in input a string and not a node-set. This means that writing something like:
"translate(/xmlDoc/items/item/text(),'-','')"
or
"translate(/xmlDoc/items/item,'-','')"
will result in a function call on the first node only (item[1]).
In XPath 1.0 I think you have no other chances than doing something ugly like:
"concat(translate(/xmlDoc/items/item,'-',''),
translate(/xmlDoc/items/item[2],'-',''))"
Which is privative for a huge list of items and returns just a string.
In XPath 2.0 this can be solved nicely using for expressions:
"for $item in /xmlDoc/items/item
return replace($item,'-','')"
Which returns a sequence type:
abc cde
PS Do no confuse function calls with location paths. They are different kind of expressions, and in XPath 1.0 can not be mixed.
here is yet anther example, running it against chrome developer tools, in prepartion for a selenium test.
$x("//table[#id='sometable_table']//tr[1=1 and ./td[2=2 and position()=2 and .//*[translate(text(), ',', '') ='1001'] ] ]/td[position()=2]")
Essentially the the data sometable_table has a column containing numbers that appear localized. For example 1001 would appear as 1,001. With the above you have somewhat nasty xpath expression.
Where first you select all table rows. Then you focus on the data of the position 2 table data for the row. Then you go deeper into the contents of the position=2 table data expand the data on the cell until you find any node whose text after string replacement is 1001. Finally you ask for the table at position 2 to be returned.
But since all your main filters are at the table row level, you could be doing additional filters at table data columns at other positions as well, if you need to find the appropriate table row that has content (A) on a cell column and content (B) on a different column.
NOTE:
It was actually quite nasty to write this, because intuitively, we all google for XPATH replace string. So I was getting furstrated trying to use xpath replace until i realized chrome supports XPATH 1.0. In xpath 1.0 the string functions that exist are different from xpath 2.0, you need to use this translate function.
See reference:
http://www.edankert.com/xpathfunctions.html

Resources