XPath expression - xpath

This question regards XPath expressions.
I want to find the average of the length of all URLs in a Web page, that point to a .pdf file.
So far I have constructed the following expression, but it does not work:
sum(string-length(string(//a/#href[contains(., ".pdf")]))) div
count(//a/#href[contains(., ".pdf")])
Any help will be appreciated!

You will need XPath 2.0.
For calculating the sum of the string lengths, you will need either
need a concatenated string of all #hrefs to apply to string-lenght($string as xs:string) (which only allows a single string as parameter), but concat(...) only takes an arbitrary number of atomar strings, not a sequence of those; or
apply string-length(...) on every #href as #Navin Rawat proposed - but using arbitrary functions in axis steps is a new feature of XPath 2.0.
If using XPath 2.0, there are functions avg(...) and ends-with(...) which help you in stripping down the expression to
avg(//a/#href[ends-with(., '.pdf')]/string-length())
If you have to stick with XPath 1.0, all you can do is using my expression below to fetch the URLs and calculate the average outside XPath.
Anyway, the subexpression you proposed will fail at URLs like http://example.net/myfile.pdf.txt. Only compare the end of the URL:
//a[#href[substring(., string-length(.) - 3) = '.pdf']]/#href
And you missed a path step for the attribute, so you've been trying to average the string length of the link names right now.

Please put something like:
sum(//a/#href[contains(.,'.pdf')]/string-length()) div count(//a/#href[contains(.,'.pdf')])

Related

How to get multiple occurences of an element with XPath under usage of normalize-space and substring-before

I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.

Is there a short and elegant way to write an XPath 1.0 expression to get all HREF values containing at least one of many search values?

I was just wondering if there is a shorter way of writing an XPath query to find all HREF values containing at least one of many search values?
What I currently have is the following:
//a[contains(#href, 'value1') or contains(#href, 'value2')]
But it seems quite ugly, especially if I were to have more values.
First of all, in many cases you have to live with the "ugliness" or long-windedness of expressions if only XPath 1.0 is at your disposal. Elegance is something introduced with version 2.0, I'd daresay.
But there might be ways to improve your expression: Is there a regularity to the href attributes you'd like to find? For instance, if it is sufficient as a rule to say that the said href attribute values must start with "value", then the expression could be
//a[starts-with(#href,'value')]
I know that "value1" and "value2" are most probably not your actual attribute values but there might be something else that uniquely identifies the group of a elements you're after. Post your HTML input if this is something you want us to help you with.
Personally, I do not find your expression ugly. There is just one or operator and the expression is quite short and readable. I take
if I were to have more values.
to mean that currently, there are only two attribute values you are interested in and that your question therefore is a theoretical one.
In case you're using XPath 2 and would like to have exact matches instead of also matches only containing part of a search value, you can shorten with
//a[#href = ('value1', 'value2')]
For contains() this syntax wouldn't work as the second argument of contains() is only allowed to be 0 or 1 value.
In XPath 2 you could also use
//a[some $s in ('value1', 'value2') satisfies contains(#href, $s)]
or
//a[matches(#href, "value1|value2")]

Is it safe to concatenate two XPath 1.0 queries?

If I have two XPath queries where the second one is meant to further drill down the result of the first, can I safely let my script combine them into a single query by...
placing parenthesis around the first query,
prefixing the second query with with a slash, and then
simply concatenating the two strings ?
Context
The concrete usecase that sparked this question involves extracting information from XML/XHTML documents according to externally supplied pairs of "CSS selector + attribute name" using XPath behind the scenes.
For example the script may get the following as input:
selector: a#home, a.chapter
attribute: href
It then compiles the selector to an XPath query using the HTML::Selector::XPath Perl module, and the attribute by simply prefixing a # ... which in this case would yield:
XPath query 1: //a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')]
XPath query 2: #href
And then it repeatedly passes those queries to libxml2's XPath engine to extract the requested information (in this example, a list of URLs) from the XML documents in question.
It works, but I would prefer to combine the two queries into a single one, which would simplify the code for invoking them and reduce the performance overhead:
XPath query: (//a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')])/#href
(note the added parenthesis and slash)
But is this safe to do programmatically, for arbitrary input queries?
In general, no, you can't concatenate two arbitrary XPath expressions in this way, especially not in XPath 1.0. It's easy to find counter-examples: in XPath 1.0 you can't even have a union expression on the RHS of '/', so concatenating "/a" and "(b|c)" would fail.
In XPath 2.0, the result will always be syntactically valid, but in may contain type errors, e.g. if the expressions are "count(a)" and "b". The LHS operand of "/" must evaluate to a sequence of nodes.
Sure, this should work. However, you will always have to respect the correct context. If the elements in your example in the first query have no href attribute, you will get an empty result set.
Also, you will have to take care of e.g. a leading slash in front of your second query, so that you don't end up with a descendant-or-self axis step, which might not be what you want. Apart from that, this should always work - The worst that can happen that it is not logical correct (i.e. you don't get the expected result), but it should always be valid XPath.

XPath different in IE and Firefox. Why?

I used Firebug's Inspect Element to capture the XPath in a webpage, and it gave me something like:
//*[#id="Search_Fields_profile_docno_input"]
I used the Bookmarklets technique in IE to capture the XPath of the same object, and I got something like:
//INPUT[#id='Search_Fields_profile_docno_input']
Notice, the first one does not have INPUT instead has an asterisk (*). Why am I getting different XPath expressions? Does it matter which one I use for my tests like:
Selenium.Click(//*[#id="Search_Fields_profile_docno_input"]);
OR
Selenium.Click(//INPUT[#id='Search_Fields_profile_docno_input']);
*[Id=] denotes that it can be any element while the second one clearly mentions selenium to look ONLY for INPUT fields which have id as Search_Fields_profile_docno_input. The second xpath is better due to following reasons
It takes more time to find the element using * as IDs of all elements should be matched.
If your HTML code is not "well written" there could be other elements which have the same id and this could cause your test to fail.
The first one matches any element with a matching ID, whereas the second one restricts matches to <input> elements. If these were CSS expressions it'd be the difference between #Search_Fields_profile_docno_input and input#Search_Fields_profile_docno_input.
Assuming you only use this ID once in your web page, the two XPaths are effectively equivalent. They'll both match the <input id="Search_Fields_profile_docno_input"> element and no other.
There are some good answers to your "why?" question here, but for Selenium use, there's an even better alternative. Since your page element has an ID attribute, use Selenium's ID locator instead of XPath or CSS:
Selenium.Click("id=Search_Fields_profile_docno_input");
This will go directly to the element, and will run quicker than just about any other locator. Note that the syntax is id=value, not id="value".
Given any element in your document, there's an infinite number of XPath expressions that will select it uniquely. Therefore it's entirely reasonable for two different products to generate two different paths.
Google has just released Wicked Good XPath - A rewrite of Cybozu Lab's famous JavaScript-XPath. Link: https://code.google.com/p/wicked-good-xpath/ The rewritten version is 40% smaller and about %30 faster than the original implementation.
You can check this out and replace the one being used in Selenium.

How to use the "translate" Xpath function on a node-set

I have an XML document that contains items with dashes I'd like to strip
e.g.
<xmlDoc>
<items>
<item>a-b-c</item>
<item>c-d-e</item>
<items>
</xmlDoc>
I know I can find-replace a single item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the entire set?
This doesn't work
/xmldoc/items/item/translate(text(),'-','')
Nor this
translate(/xmldoc/items/item/text(),'-','')
Is there a way at all to achieve that?
I know I can find-replace a single
item using this xpath
/xmldoc/items/item[1]/translate(text(),'-','')
Which will return
"abc"
however, how do I do this for the
entire set?
This cannot be done with a single XPath 1.0 expression.
Use the following XPath 2.0 expression to produce a sequence of strings, each being the result of the application of the translate() function on the string value of the corresponding node:
/xmlDoc/items/item/translate(.,'-', '')
The translate function accepts in input a string and not a node-set. This means that writing something like:
"translate(/xmlDoc/items/item/text(),'-','')"
or
"translate(/xmlDoc/items/item,'-','')"
will result in a function call on the first node only (item[1]).
In XPath 1.0 I think you have no other chances than doing something ugly like:
"concat(translate(/xmlDoc/items/item,'-',''),
translate(/xmlDoc/items/item[2],'-',''))"
Which is privative for a huge list of items and returns just a string.
In XPath 2.0 this can be solved nicely using for expressions:
"for $item in /xmlDoc/items/item
return replace($item,'-','')"
Which returns a sequence type:
abc cde
PS Do no confuse function calls with location paths. They are different kind of expressions, and in XPath 1.0 can not be mixed.
here is yet anther example, running it against chrome developer tools, in prepartion for a selenium test.
$x("//table[#id='sometable_table']//tr[1=1 and ./td[2=2 and position()=2 and .//*[translate(text(), ',', '') ='1001'] ] ]/td[position()=2]")
Essentially the the data sometable_table has a column containing numbers that appear localized. For example 1001 would appear as 1,001. With the above you have somewhat nasty xpath expression.
Where first you select all table rows. Then you focus on the data of the position 2 table data for the row. Then you go deeper into the contents of the position=2 table data expand the data on the cell until you find any node whose text after string replacement is 1001. Finally you ask for the table at position 2 to be returned.
But since all your main filters are at the table row level, you could be doing additional filters at table data columns at other positions as well, if you need to find the appropriate table row that has content (A) on a cell column and content (B) on a different column.
NOTE:
It was actually quite nasty to write this, because intuitively, we all google for XPATH replace string. So I was getting furstrated trying to use xpath replace until i realized chrome supports XPATH 1.0. In xpath 1.0 the string functions that exist are different from xpath 2.0, you need to use this translate function.
See reference:
http://www.edankert.com/xpathfunctions.html

Resources