I am trying to get a xpath query using the xpath function lower-case or upper-case, but they seem to not work in selenium (where I test my xpath before I apply it).
Example that does NOT work:
//*[.=upper-case('some text')]
I have no problem locating the nodes I need in complex path and even using aggregated functions, as long as I don't use the upper and lower case.
Has anyone encountered this before? Does it make sense?
Thanks.
upper-case() and lower-case() are XPath 2.0 functions. Chances are your platform supports XPath 1.0 only.
Try:
translate('some text','abcdefghijklmnopqrstuvwxyz','ABCDEFGHIJKLMNOPQRSTUVWXYZ')
which is the XPath 1.0 way to do it. Unfortunately, this requires knowledge of the alphabet the text uses. For plain English, the above probably works, but if you expect accented characters, make sure you add them to the list.
In most environments you are using XPath out of a host language of some sort, and can use the host language's capabilities to work around this XPath 1.0 limitation by externally providing upper- and lower-case variants of the search string to translate().
Shown on the example of Python:
search = 'Some Text'
lc = search.lower()
uc = search.upper()
xpath = f"//p[contains(translate(., '{lc}', '{uc}'), '{uc}')]"
This would produce the following XPath expression:
//p[contains(translate(., 'some text', 'SOME TEXT'), 'SOME TEXT')]
which searches case-insensitively and works for arbitrary search text.
If you are going to need upper case in multiple places in your xslt, you can define variables for the lower case and upper case and then use them in your translate function everywhere. It should make your xslt much cleaner.
Example at XSL/XPATH : No upper-case function in MSXML 4.0 ?
Related
I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.
I was just wondering if there is a shorter way of writing an XPath query to find all HREF values containing at least one of many search values?
What I currently have is the following:
//a[contains(#href, 'value1') or contains(#href, 'value2')]
But it seems quite ugly, especially if I were to have more values.
First of all, in many cases you have to live with the "ugliness" or long-windedness of expressions if only XPath 1.0 is at your disposal. Elegance is something introduced with version 2.0, I'd daresay.
But there might be ways to improve your expression: Is there a regularity to the href attributes you'd like to find? For instance, if it is sufficient as a rule to say that the said href attribute values must start with "value", then the expression could be
//a[starts-with(#href,'value')]
I know that "value1" and "value2" are most probably not your actual attribute values but there might be something else that uniquely identifies the group of a elements you're after. Post your HTML input if this is something you want us to help you with.
Personally, I do not find your expression ugly. There is just one or operator and the expression is quite short and readable. I take
if I were to have more values.
to mean that currently, there are only two attribute values you are interested in and that your question therefore is a theoretical one.
In case you're using XPath 2 and would like to have exact matches instead of also matches only containing part of a search value, you can shorten with
//a[#href = ('value1', 'value2')]
For contains() this syntax wouldn't work as the second argument of contains() is only allowed to be 0 or 1 value.
In XPath 2 you could also use
//a[some $s in ('value1', 'value2') satisfies contains(#href, $s)]
or
//a[matches(#href, "value1|value2")]
This question regards XPath expressions.
I want to find the average of the length of all URLs in a Web page, that point to a .pdf file.
So far I have constructed the following expression, but it does not work:
sum(string-length(string(//a/#href[contains(., ".pdf")]))) div
count(//a/#href[contains(., ".pdf")])
Any help will be appreciated!
You will need XPath 2.0.
For calculating the sum of the string lengths, you will need either
need a concatenated string of all #hrefs to apply to string-lenght($string as xs:string) (which only allows a single string as parameter), but concat(...) only takes an arbitrary number of atomar strings, not a sequence of those; or
apply string-length(...) on every #href as #Navin Rawat proposed - but using arbitrary functions in axis steps is a new feature of XPath 2.0.
If using XPath 2.0, there are functions avg(...) and ends-with(...) which help you in stripping down the expression to
avg(//a/#href[ends-with(., '.pdf')]/string-length())
If you have to stick with XPath 1.0, all you can do is using my expression below to fetch the URLs and calculate the average outside XPath.
Anyway, the subexpression you proposed will fail at URLs like http://example.net/myfile.pdf.txt. Only compare the end of the URL:
//a[#href[substring(., string-length(.) - 3) = '.pdf']]/#href
And you missed a path step for the attribute, so you've been trying to average the string length of the link names right now.
Please put something like:
sum(//a/#href[contains(.,'.pdf')]/string-length()) div count(//a/#href[contains(.,'.pdf')])
I am new to using XPath and I am trying to retrieve a node via its attribute but the problem is that the attribute is case insensitive meaning I won't exactly know how the string is cased in the document.
So for example:
Given the document:
<Document xmlns:my="http://www.MyDomain.com/MySchemaInstance">
<Machines>
<Machine FQDN="machine1.mydomain.com">
<...>
</Machine>
<Machine FQDN="Machine2.MyDomain.Com">
<...>
</Machine>
</Machines>
</Document>
If I want to retrieve the machine1 I would use the XPath:
//my:Machines/my:Machine/*[#FQDN='machine1.mydomain.com']
But a similar XPath to get machine2 would fail becuase the case does not match:
//my:Machines/my:Machine/*[#FQDN='machine2.mydomain.com'] //Fails
I have seen various posts mention using something like (I am not sure how to apply Namespaces to this):
translate(#FQDN, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')
But even if I got it to work it would be really cumbersome considering the number of times I would be using it.
Finally I have read that XPath 2.0 supports matches() and lower-case() but being new to XPath I don't understand how to apply them:
For example if I try the following I get an "Invalid Qualified name":
//my:Machines/my:Machine/[matches(#FQDN, '(?i)machine1.mydomain.com')]
//my:Machines/my:Machine/[lower-case(#FQDN, 'machine1.mydomain.com')]
Can someone provide a sample XPath that includes handling of Namespaces that would work?
Thanks
Your example XML and XPath statements don't match.
The sample XML elements are not bound to a namespace. The "my" namespace-prefix is declared, but not used for those elements, so they are in the "no namespace".
Your sample XPath is using predicate filters on the children of Machine rather than on the Machine element that has the #FQDN.
You could use either of these methods to look for the value case-insensitive:
matches() function, with a flag for case-insensitive matching:
//Machines/Machine[matches(#FQDN,'machine2.mydomain.com','i')]
upper-case() function to evaluate the upper-case strings:
//Machines/Machine[upper-case(#FQDN)=upper-case('machine2.mydomain.com')]
lower-case() function to evaluate the lower-case strings:
//Machines/Machine[lower-case(#FQDN)=lower-case('machine2.mydomain.com')]
Can someone provide a sample XPath that includes handling of
Namespaces that would work?
Not sure what you meant by the handling of namespaces, but if you wanted to match on those elements regardless of their namespace then you can use the wildcard operator for the namespace:
//*:Machines/*:Machine[matches(#FQDN,'machine2.mydomain.com','i')]
I'm using nokogiri to select the 'keywords' attribute like this:
puts page.parser.xpath("//meta[#name='keywords']").to_html
One of the pages I'm working with has the keywords label with a capital "K" which has motivated me to make the query case insensitive.
<meta name="keywords"> AND <meta name="Keywords">
So, my question is: What is the best way to make a nokogiri selection case insensitive?
EDIT Tomalak's suggestion below works great for this specific problem. I'd like to also use this example to help understand nokogiri better though and have a couple issues that I'm wondering about and have not been successful searching for. For example, are the regex 'pseudo classes' Nokogiri Docs appropriate for a problem like this?
I'm also curious about the matches?() method in nokogiri. I have not been able to find any clarification on the method. Does it have anything to do with the 'matches' concept in XPath 2.0 (and therefore could it be used to solve this problem)?
Thanks very much.
Nokogiri allows custom XPath functions. The nokogiri docs that you link to show an inline class definition for when you're only using it once. If you have a lot of custom functions or if you use the case-insensitive match a lot, you may want to define it in a class.
class XpathFunctions
def case_insensitive_equals(node_set, str_to_match)
node_set.find_all {|node| node.to_s.downcase == str_to_match.to_s.downcase }
end
end
Then call it like any other XPath function, passing in an instance of your class as the 2nd argument.
page.parser.xpath("//meta[case_insensitive_equals(#name,'keywords')]",
XpathFunctions.new).to_html
In your Ruby method, node_set will be bound to a Nokogiri::XML::NodeSet. In the case where you're passing in an attribute value like #name, it will be a NodeSet with a single Nokogiri::XML::Attr. So calling to_s on it gives you its value. (Alternatively, you could use node.value.)
Unlike using XPath translate where you have to specify every character, this works on all the characters and character encodings that Ruby works on.
Also, if you're interested in doing other things besides case-insensitive matching that XPath 1.0 doesn't support, it's just Ruby at this point. So this is a good starting point.
Wrapped for legibility:
puts page.parser.xpath("
//meta[
translate(
#name,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'
) = 'keywords'
]
").to_html
There is no "to lower case" function in XPath 1.0, so you have to use translate() for this kind of thing. Add accented letters as necessary.