This question is specifically about using XPath in XSLT 2.0 and Saxon.
XPaths ending with [1]
For XPaths like
following-sibling::foo[1]
descendant::bar[1]
I take it for granted that Saxon will not iterate over the entire axis but stop when it finds the first matching node - crucial in situations like:
following-sibling::foo[some:expensivePredicate(.)][1]
I assume that this is also the case for XPaths like this:
(following-sibling::foo/descendant::bar)[1]
I.e. Saxon will not compile the entire set of nodes matching following-sibling::foo/descendant::bar before picking the first one in the set. Rather, it will (even for chained axes) stop at the first matching node.
XPaths ending with [last()]
Now it gets interesting. When going "backwards" in the tree, I assume that XPaths like
preceding-sibling::foo[1]
work just as efficiently as their following-sibling equivalents. But what happens when chaining axes, e.g.
(preceding-sibling::foo/descendant::bar)[last()]
As we need to use [last()] here instead of [1],
will Saxon compile the entire set of nodes to count them to get a numeric value for last()?
Or will it be smart and stop iterating the preceding-sibling axis when it found a matching descendant?
Or will it be even more clever and iterate the descendant axis in reverse to more efficiently find the last descendant?
Saxon has a variety of strategies for evaluating last(). When used as a predicate, meaning [position()=last()], it is generally translated to an internal function [isLast()] which can be evaluated by a single-item lookahead. (So in your example of (preceding-sibling::foo /descendant::bar)[last()], it doesn't build the node-set in memory, rather it reads the nodes one by one and when it hits the end, returns the last one it found).
In other cases, particularly when used in XSLT match patterns, Saxon will convert child::x[last()] to child::x[not(following-sibling::x)].
When none of these approaches work, for many years Saxon had two strategies for evaluating last() depending on the expression it was applied to: (a) sometimes it would evaluate the expression twice, counting nodes the first time and returning them the second time; (b) in other cases it would read all the nodes into memory. We've recently encountered cases where strategy (a) fails: see https://saxonica.plan.io/issues/3122, and so we're always doing (b).
The last() expression is potentially expensive and it should be avoided where possible. For example the classic "insert a separator between adjacent items" which is often written
xx
if (position() != last()) sep
is much better written as
if (position() != 1) sep
xx
i.e. instead of inserting the separator after every item except the last, insert it before every item except the first. Or use string-join, or xsl:value-of/#separator.
XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,
empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.
These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.
Given all that ... what would be a clear and readable way to write an XPath test expression that asks
Does value X occur in sequence S?
Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)
X = S: This would be one of the most unreadable, since it requires the reader to
think about which of X and S are sequences vs. single values
understand general comparisons, which are not obvious from the syntax
However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).
I'm kind of leaning toward the last one, at this point: exists(S[. eq X])
What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?
Apologies for the long question. Thanks for reading this far.
Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.
For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:
<xsl:when test="#type = ('A', 'A+', 'A-', 'B+')" />
or
<xsl:when test="#type = $magic-types"/>
If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.
I prefer this one:
count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))
When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.
To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.
The functx library has a nice implementation of this function, so you can use
functx:is-node-in-sequence($X, $Y)
(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)
The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)
Marklogic ships the functx library with their core product; other vendors may also.
Another possibility, when you want to know whether node X occurs in sequence S, is
exists((X) intersect S)
I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask
exists(('bob') intersect ('alice', 'bob'))
you'll get a runtime error.
In the program I'm working on now, I need to compare strings, so this isn't an option.
As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.
This is a sample from an XML document.
<A>
<Value>B2.B1-1.C2-0.D20</Value>
</A>
<A>
<Value>A2.B15-1.C2-0.D20</Value>
</A>
<A>
<Value>A2.B2-1.C2-0.D20</Value>
</A>
and so on.
I need to sort this to look like
A2.B2-1.C2-0.D20
A2.B15-1.C2-0.D20
B2.B1-1.C2-0.D20
The number of dot separated components are not known and the numbers in them can be in any format (1-1,11,11abcd). The sorting is intuitive as one would normally expect. First it is based on letters and the numbers are bunched together and read (B2 and B15 is the correct order. The lexical order B15 , B2 is not correct)
Can this be done with XSLT 1.0 ?
XSLT doesn't define precisely how sorting should operate; the results are implementation-defined.
In recent releases of Saxon there is a collation that does what you want, but that assumes XSLT 2.0; in fact it assumes Saxon.
Doing it in a portable way in XSLT 1.0 is not easy, especially as you can't call out to recursive templates to compute the sort key.
If I have a context node, an XPath expression and a node, is there a way to check if my node satisfies the XPath expression in that context.
I have XPath queries that are very expensive and long to run. Here I would simply like to take a potential result node and check if it satisfies the query, i.e. it would be returned as part of the query result set.
I am using Saxon EE 9.3
If your context node is $N, your expression is E, and the node being tested is $T, then the expression boolean($N/(EXP) intersect $T) does what you are asking for in the first part of your question. However, it may not meet the requirement implied by the second part of the question, which is that the computation should be faster than evaluating EXP.
If the expression EXP takes the form of an XSLT pattern then the answer is yes, there is a way and it is likely to be faster (though how to achieve this depends on what Saxon APIs you are using). Note that when EXP is a pattern, the question of whether $T matches the pattern does not depend on knowing a context node $N.
When you search in Google "100F to C" how does it know to convert from Fahrenheit to Celsius? Similarly, conversion from different currencies and simple calculation.
What is the data structure used, or is it simple pattern matching the strings?
It's not exactly simple pattern matching. Evaluating the mathematical expressions you can enter is not trivial. For example, here's an algorithm that evaluates a math expression. That's just the evaluation, there's probably a lot of code to detect if it's even valid.
For the currencies conversion and other units, that's simple pattern matching.
it's simple pattern matching
try
100 kmh in mph = no calculation
100 kph in mph = 62.1371192 mph