XSLT Custom Sorting using XSLT 1.0 - sorting

This is a sample from an XML document.
<A>
<Value>B2.B1-1.C2-0.D20</Value>
</A>
<A>
<Value>A2.B15-1.C2-0.D20</Value>
</A>
<A>
<Value>A2.B2-1.C2-0.D20</Value>
</A>
and so on.
I need to sort this to look like
A2.B2-1.C2-0.D20
A2.B15-1.C2-0.D20
B2.B1-1.C2-0.D20
The number of dot separated components are not known and the numbers in them can be in any format (1-1,11,11abcd). The sorting is intuitive as one would normally expect. First it is based on letters and the numbers are bunched together and read (B2 and B15 is the correct order. The lexical order B15 , B2 is not correct)
Can this be done with XSLT 1.0 ?

XSLT doesn't define precisely how sorting should operate; the results are implementation-defined.
In recent releases of Saxon there is a collation that does what you want, but that assumes XSLT 2.0; in fact it assumes Saxon.
Doing it in a portable way in XSLT 1.0 is not easy, especially as you can't call out to recursive templates to compute the sort key.

Related

How can Clojure data-structures best be tagged with a type?

I'm writing a program that's manipulating polynomials. I'm defining polynomials recursively as either a term (base case) or a sum or product of polynomials (recursive cases).
Sums and products are completely identical as far as their contents are concerned. They just contain a sequence of polynomials. But they need to be processed very differently. So to distinguish them I have to somehow tag my sequences of polynomials.
Currently I have two records - Sum and Product - defined. But this is causing my code to be littered with the line (:polynomials sum-or-product) to extract the contents of polynomials. Also printing out even small polynomials in the REPL produces so much boilerplate that I have to run everything through a dedicated prettyprinting routine if I want to make sense of it.
Alternatives I have considered are tagging my sums and products using metadata instead, or putting a + or * symbol at the head of the sequence. But I'm not convinced that either of these approaches are good style and I'm wondering if there's perhaps another option I haven't considered yet.
Putting a + or * symbol at the head of the sequence sounds like it would print out nicely. I would try implementing the processing of these two different "types" via multimethods, which keeps the calling convention neat and extensible. That document starts from object-oriented programmin view, but the "area of a shape" is a very neat example on what this approach can accomplish.
In your case you'd use first of the seq to determine if you are dealing with a sum or a product of polynomials, and the multimethod would automagically use the correct implementation.

Chained XPath axes with trailing `[1]` or `[last()]` predicates

This question is specifically about using XPath in XSLT 2.0 and Saxon.
XPaths ending with [1]
For XPaths like
following-sibling::foo[1]
descendant::bar[1]
I take it for granted that Saxon will not iterate over the entire axis but stop when it finds the first matching node - crucial in situations like:
following-sibling::foo[some:expensivePredicate(.)][1]
I assume that this is also the case for XPaths like this:
(following-sibling::foo/descendant::bar)[1]
I.e. Saxon will not compile the entire set of nodes matching following-sibling::foo/descendant::bar before picking the first one in the set. Rather, it will (even for chained axes) stop at the first matching node.
XPaths ending with [last()]
Now it gets interesting. When going "backwards" in the tree, I assume that XPaths like
preceding-sibling::foo[1]
work just as efficiently as their following-sibling equivalents. But what happens when chaining axes, e.g.
(preceding-sibling::foo/descendant::bar)[last()]
As we need to use [last()] here instead of [1],
will Saxon compile the entire set of nodes to count them to get a numeric value for last()?
Or will it be smart and stop iterating the preceding-sibling axis when it found a matching descendant?
Or will it be even more clever and iterate the descendant axis in reverse to more efficiently find the last descendant?
Saxon has a variety of strategies for evaluating last(). When used as a predicate, meaning [position()=last()], it is generally translated to an internal function [isLast()] which can be evaluated by a single-item lookahead. (So in your example of (preceding-sibling::foo /descendant::bar)[last()], it doesn't build the node-set in memory, rather it reads the nodes one by one and when it hits the end, returns the last one it found).
In other cases, particularly when used in XSLT match patterns, Saxon will convert child::x[last()] to child::x[not(following-sibling::x)].
When none of these approaches work, for many years Saxon had two strategies for evaluating last() depending on the expression it was applied to: (a) sometimes it would evaluate the expression twice, counting nodes the first time and returning them the second time; (b) in other cases it would read all the nodes into memory. We've recently encountered cases where strategy (a) fails: see https://saxonica.plan.io/issues/3122, and so we're always doing (b).
The last() expression is potentially expensive and it should be avoided where possible. For example the classic "insert a separator between adjacent items" which is often written
xx
if (position() != last()) sep
is much better written as
if (position() != 1) sep
xx
i.e. instead of inserting the separator after every item except the last, insert it before every item except the first. Or use string-join, or xsl:value-of/#separator.

Average numbers from impure nodes using pure xpath 1.0

Is it possible to sum the temperatures following XML:
<days>
<day><temperature>40 F</temperature></day>
<day><temperature>45 F</temperature></day>
<day><temperature>50 F</temperature></day>
</days>
In xpath 2.0, I could get the average of the three numbers using
avg(//days/day/temperature/number(translate(.,' F','')))
Is it possible to write an expression in pure xpath 1.0 that can do the same thing? This response to a question regarding the use of 'sum' on impure nodes in xpath 1.0 causes me to think that maybe it's not.
So, to summarize, is there any way to get the average temperature from these impure nodes using only an xpath 1.0 expression?
sum(//days/day/temperature/number(translate(.,' F',''))) div count(//days/day/temperature)

XSLT Performance

I am working for a project which has many XSLT transformations.
The transformations have to be as fast as possible.
For readability I wrote many of them dividing "business logic" and
"output". For example
<!-- Business Logic -->
<xsl:variable name="myLocalVar">
<xsl:value-of select="func:whateverComputation(params)" />
</xsl:variable>
<!-- more buss logic here -->
<!-- Output -->
<xsl:element name="mytag">
<xsl:value-of select="$myLocalVar" />
</xsl:element>
Of course this can be written in a compact form
<xsl:element name="mytag">
<xsl:value-of select="func:whateverComputation(params)" />
</xsl:element>
Is the first form slower than the second one?
From a section of the the XSLT FAQ:
Few Points related to XSLT Performance:
xsl:variables are dynamic values. These variables are not in cache, and run every time that they are referenced in XSL. Explicit type casting of xsl:variable improves the performance. You can do type casting with string() and boolean() functions.
For example:
<xsl:variable name="_attr" select="string( /node/child[ #attr ] )">
Instead of using sub-elements, use attributes wherever possible. Using attributes instead of elements improves the performance. When performing XPath matches, attributes are faster because they are loosely typed. This makes validation of the schema easier.
When you match against attribute values, use enumerator attributes. Use multiple attribute names as bits, and set their values to true or false.
Eight tips for how to use XSLT efficiently:
Keep the source documents small. If necessary split the document first.
Keep the XSLT processor (and Java VM) loaded in memory between runs
If you use the same stylesheet repeatedly, compile it first.
If you use the same source document repeatedly, keep it in memory.
If you perform the same transformation repeatedly, don't. Store the result instead.
Keep the output document small. For example, if you're generating HTML, use CSS.
Never validate the same source document more than once.
Split complex transformations into several stages.
Eight tips for how to write efficient XSLT:
Avoid repeated use of "//item".
Don't evaluate the same node-set more than once; save it in a variable.
Avoid <xsl:number> if you can. For example, by using position().
Use <xsl:key>, for example to solve grouping problems.
Avoid complex patterns in template rules. Instead, use <xsl:choose> within the rule.
Be careful when using the preceding[-sibling] or following[-sibling] axes. This often indicates an algorithm with n-squared performance.
Don't sort the same node-set more than once. If necessary, save it as a result tree fragment and access it using the node-set() extension function.
To output the text value of a simple #PCDATA element, use <xsl:value-of> in preference to <xsl:apply-templates>.
Saving the result of function application to a variable isn't going to have any significant impact on performance in the general case (and some XSLT processors such as Saxon use lazy evaluation, so the function will not be evaluated untill the variable is actually needed).
On the contrary, if the function must be evaluated more than once with the same parameters, saving the result in a variable can result in some cases in significant increase of efficiency.
The correct way to improve performance is:
Profile/measure to identify real bottlenecks.
Optimize only the biggest bottlenecks.
If there is still need for increased performance, start a new iteration, going to 1. above.
To quote Donald Knuth: "Premature optimization is the root of all evil" -- which is actually a paraphrase of the wellknown saying: "The road to hell is paved with good intentions."
A little late to the game, but I thought I'd share this link: Techniques to Improve Performance of XSL Transformations.

Deducing string transformation rules

I have a set of pairs of character strings, e.g.:
abba - aba,
haha - aha,
baa - ba,
exb - esp,
xa - za
The second (right) string in the pair is somewhat similar to the first (left) string.
That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters.
There's no simple rule for this character-to-character mapping, although there are some patterns.
Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings?
The solution can be approximate, working correctly for, say, 80-95% of the strings.
Would you recommend to use some kind of a genetic algorithm? If so, how?
If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. If you had such tables, you could align the characters using http://en.wikipedia.org/wiki/Dynamic_time_warping. One approach is therefore to guess an alignment (e.g. one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum.
There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify.
You can use text to speech to create sound waves. then compare sound waves with other's and match them with percentages.
This is my theory how Google has such a advanced spell checker.

Resources