Is this XPath technique reliable in all situations? - xpath

I am developing an application that accepts user-defined XPath expressions and employs them as part of its runtime operation.
However, I would like to be able to infer some additional data by programmatically manipulating the expression, and I am curious to know whether there are any situations in which this approach might fail.
Given any user-defined XPath expression that returns a node set, is it safe to wrap it in the XPath count() function to determine the number of nodes in the set:
count(user_defined_expression)
Similarly, is it safe to append an array index to the expression to extract one of the nodes in the set:
user_defined_expression[1]

Well an XPath expression (in XPath 1.0) can yield a node-set or a string or a number or a boolean and doing count(expression) only makes sense on any expression yielding a node-set.
As for adding a positional predicate, I think you might want to use parentheses around your expression i.e. to change /root/foo/bar into (/root/foo/bar)[1] as that way you select the first bar element in the node-set selected by /root/foo/bar while without them you would get /root/foo/bar[1] which would select the first bar child element of any foo child element of the root element.

Are you checking that such user-defined expressions always evaluate to node-set?
If yes, first Expr is ok. Datatype will be correct for fn:count
Second one is a lot trickier, with a lot of situations there predicate will overweight axis, for example. Check this answer for a simple analysis. It will be difficult to say, what a user really meant.

A more robust approach would be to convert the XPath expression to XQueryX, which is an XML representation of the abstract syntax tree; you can then do XQuery or XSLT transformations on this XML representation, and then convert back to a modified XPath (or XQuery) for evaluation.
However, this will still only give you the syntactic structure of the expression; if you want semantic information, such as the inferred static type of the result, you will probably have to poke inside an XPath process that exposes this information.

Related

How to get multiple occurences of an element with XPath under usage of normalize-space and substring-before

I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.

Why did the definition of dot (.) change between XPath 1.0 and 2.0?

When researching details for an answer to an XPath question here on Stack Overflow, I run into a difference between XPath 1.0 and 2.0 I can find no rationale for.
I tried to understand what . really means.
In XPath 1.0, . is an abbreviation for self::node(). Both self and node are crystal-clear to me.
In XPath 2.0, . is primary expression "context item expression". Abbreviated Syntax section explicitly states that as a note.
What was the rationale for the change? Is there a difference between . and self::node() in XPath 2.0?
From the spec itself, the intent of the change is not clear to me. I tried googling keywords like dot or period, primary expression, and rationale.
XPath 1.0 had four data types: string, number, boolean, and node-set. There was no way of handling collections of values other than nodes. This meant, for example, that there was no way of summing over derived values (if elements had attributes of the form price='$23.95', there was no way of summing over the numbers obtained by stripping off the $ sign because the result of such stripping would be a set of numbers, and there was no such data type).
So XPath 2.0 introduced more general sequences, and that meant that the facilities for manipulating sequences had to be generalised; for example if $X is a sequence of numbers, then $X[. > 0] filters the sequence to include only the positive numbers. But that only works if "." can refer to a number as well as to a node.
In short: self::node() filters out atomic items, while . does not. Atomic items (numbers, strings, and many other XML Schema types) are not nodes (unlike elements, attributes, comments, etc.).
Consider the example from the spec: (1 to 100)[. mod 5 eq 0]. If the . is replaced by self::node(), the expression is not valid XPath, because mod requires both arguments to be numeric and atomization does not help in this case.
For those scanning the spec: XPath 2.0 defines item() type-matching construct, but it has nothing to do with node tests as atomics are not nodes and axis steps always return just nodes. Therefore, dot cannot be defined as self::item(). It really needs to be a special language construct.

Is there a short and elegant way to write an XPath 1.0 expression to get all HREF values containing at least one of many search values?

I was just wondering if there is a shorter way of writing an XPath query to find all HREF values containing at least one of many search values?
What I currently have is the following:
//a[contains(#href, 'value1') or contains(#href, 'value2')]
But it seems quite ugly, especially if I were to have more values.
First of all, in many cases you have to live with the "ugliness" or long-windedness of expressions if only XPath 1.0 is at your disposal. Elegance is something introduced with version 2.0, I'd daresay.
But there might be ways to improve your expression: Is there a regularity to the href attributes you'd like to find? For instance, if it is sufficient as a rule to say that the said href attribute values must start with "value", then the expression could be
//a[starts-with(#href,'value')]
I know that "value1" and "value2" are most probably not your actual attribute values but there might be something else that uniquely identifies the group of a elements you're after. Post your HTML input if this is something you want us to help you with.
Personally, I do not find your expression ugly. There is just one or operator and the expression is quite short and readable. I take
if I were to have more values.
to mean that currently, there are only two attribute values you are interested in and that your question therefore is a theoretical one.
In case you're using XPath 2 and would like to have exact matches instead of also matches only containing part of a search value, you can shorten with
//a[#href = ('value1', 'value2')]
For contains() this syntax wouldn't work as the second argument of contains() is only allowed to be 0 or 1 value.
In XPath 2 you could also use
//a[some $s in ('value1', 'value2') satisfies contains(#href, $s)]
or
//a[matches(#href, "value1|value2")]

xpath 'or' behaving like union ('|') with xmllib2

I have XML documents like:
<rootelement>
<myelement>test1</myelement>
<myelement>test2</myelement>
<myelement type='specific'>test3</myelement>
</rootelement>
I'd like to retrieve the specific myelement, and if it's not present, then the first one. So I write:
/rootelement/myelement[#type='specific' or position()=1]
The XPath spec states about the 'or expression' that:
The right operand is not evaluated if
the left operand evaluates to true
The problem is that libxml2-2.6.26 seems to apply the union of both expressions, returning a "2 Node Set" (for example using xmllint --shell).
Is it libxml2 or am I doing anything wrong ?
Short answer: your selector doesn't express what you think it does.
The or operator is a union.
The part of the spec you quoted ("The right operand is not evaluated...") is part of standard boolean logic short circuiting.
Here's why you get a 2-node set for your example input: XPath looks at every myelement that's a child of rootelement, and applies the [#type='specific' or position()=1] part to each such node to determine whether or not it matches the selector.
<myelement>test1</myelement> does not match #type='specific', but it does match position()=1, so it matches the whole selector.
<myelement>test2</myelement> does not match #type='specific', and it also does not match position()=1, so it does not match the whole selector.
<myelement type='specific'>test3</myelement> matches #type='specific' (so XPath does not have to test its position - that's the short-circuiting part) so it matches the whole selector.
The first and last <myelement>s match the whole selector, so it returns a 2-node set.
The easiest way to select elements the way you want to is to do it in two steps. Here's the pseudocode (I don't know what context you're actually using XPath in, and I'm not that familiar with writing XPath-syntax selectors):
Select elements that match /rootelement/myelement[#type='specific']
If elements is empty, select elements that match /rootelement/myelement[position()=1]
#Matt Ball explained very well the cause of your problem.
Here is an XPath one-liner selecting exactly what you want:
/*/myelement[#type='specific'] | /*[not(myelement[#type='specific'])]/myelement[1]

How do construct an xpath to select items that do not contain a string

How do I use something similar to the example below, but with the opposite result, i.e items that do not contain (default) in the text.
<test>
<item>Some text (default)</item>
<item>Some more text</item>
<item>Even more text</item>
</test>
Given this
//test/item[contains(text(), '(default)')]
would return the first item. Is there a not operator that I can use with contains?
Yes, there is:
//test/item[not(contains(text(), '(default)'))]
Hint: not() is a function in XPath instead of an operator.
An alternative, possibly better way to express this is:
//test/item[not(text()[contains(., '(default)')])]
There is a subtle but important difference between the two expressions (let's call them A and B, respectively).
Simple case: If all <item> only have a single text node child, both A and B behave the same.
Complex case: If <item> can have multiple text node children, expression A only matches when '(default)' occurs in the first of them.
This is because text() matches all text node children and produces a node-set. So far no surprise. Now, contains() accepts a node-set as its first argument, but it needs to convert it to string to do its job. And conversion from node-set to string only produces the string value of the first node in the set, all other nodes are disregarded (try string(//item) to see what I mean). In the simple case this exactly what happens as well, but the result is not as surprising.
Expression B deals with this by explicitly checking every text node individually instead of only checking the string value of the whole <item> element. It's therefore the more robust of the two.

Resources