XPath performance & versions - xpath

I have 3 questions:
1) Is XPath string "//table[position()=8 or position()=10]/td[1]/span[2]/text()" faster than the XPath string "//table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()"?
I use XPath with .NET CSharp and HTMLAgilityPack.
2) How can I determine what version of XPath I use. If I use XPath 1.0, how to upgrade to XPath 2.0?
3) Is there a performance optmimization and improvement into XPath 2.0 or just new features and new syntax?

XPath 2.0 expands significantly on XPath 1.0 (read here for a summary), though you don't need to switch unless you would benefit from the new functionality.
As for which one would be faster I believe the first one would be faster because you're repeating the node search in the second case. The first case is also more readable, and in general you want to go with the more readable one anyways.

As to the performance question, I'm afraid I don't know. It depends on the optimizer in the particular XPath processor you are using. If it's important to you, measure it. If it's not important enough to measure, then it's not important enough to worry about.
As I mentioned in my previous reply, //table[8] smells wrong to me. I think it's much more likely that you want (//table)[8]. (Both are valid XPath expressions, but they produce different answers).
You can probably assume that a processor is XPath 1.0 unless it says otherwise - if it supports 2.0, they'll want you to know. But you can easily test, for example by seeing what happens when you do //a except //b.
There's no intrinsic reason why an XPath 2.0 processor should be faster than a 1.0 processor on the same queries. In fact, it might be a bit slower, because it's required to do more careful type-checking. On the other hand it might be a lot faster, because many 1.0 processors were dashed off very quickly and never upgraded. But there are massive improvements in functionality in 2.0, for example regular expression support.

Related

What’s the difference between array[i] and array.item(i) in ExtendScript?

I’ve seen both in code samples; for instance, in Adobe InDesign CS6 JavaScript Scripting Guide:
app.documents.item(0).pages.item(0)
myDoc.pages[0]
Are they interchangeable? Which one’s the best pratice?
There's not really an interesting answer here, yes they are interchangeable and which one you choose is up to you. I did a quick performance test, and the brackets operator seems to work slightly faster, but just by a factor of 1.1, so it should not make much of a difference.
The only difference between the two (that does not apply to your scenario) is that item() can also be used to address an item by name, as in myDoc.paragraphStyles('headline'); and is therefore in turn interchangeable with itemByName().

How can I optimize an XPath expression?

Is there any way I can shorten the following condition used in an XPath expression?
(../parent::td or ../parent::ol or ../parent::ul)
The version of XPath is 1.0.
The shortest is probably
../..[self::td|self::ol|self::ul]
Whether there is a performance difference between "|" and "or" will depend on the processor, but I suspect that in most cases it won't be noticeable. For performance, the important thing is to put the conditions in the right order (the one most likely to return true should come first). Factoring out the navigation to the grandparent should almost certainly help performance, with the caveats (a) your XPath engine may do this optimization automatically, and (b) the difference will probably be so tiny you will have trouble measuring it.
use the '|' operator.
(../parent::td|../parent::ol|../parent::ul)
Slightly shorter:
../..[self::td or self::ol or self::ul]
Example usage:
//p[../..[self::td or self::ol or self::ul]]

Replacement for descendant-or-self

I have a xpath $x/descendant-or-self::*/#y which I have changed to $x//#y as it improved the performance.
Does this change have any other impact?
As explained in the W3C XPath Recommendation, // is short-hand for /descendant-or-self::node()/, so that is a slight difference. But since attributes can only occur on elements, I think this replacement is safe.
That might also explain why you see a performance boost, since MarkLogic will need to worry less whether there really are elements in between.
HTH!

Performance of XPath vs DOM

Would anyone enlighten me some comprehensive performance comparison between XPath and DOM in different scenarios? I've read some questions in SO like xPath vs DOM API, which one has a better performance and XPath or querySelector?. None of them mentions specific cases. Here's somethings I could start with.
No iteration involved. getElementById(foobar) vs //*[#id='foobar']. Is former constantly faster than latter? What if the latter is optimized, e.g. /html/body/div[#id='foo']/div[#id='foobar']?
Iteration involved. getElementByX then traverse through child nodes vs XPath generate snapshot then traverse through snapshot items.
Axis involved. getElementByX then traverse for next siblings vs //following-sibling::foobar.
Different implementations. Different browsers and libraries implement XPath and DOM differently. Which browser's implementation of XPath is better?
As the answer in xPath vs DOM API, which one has a better performance says, average programmer may screw up when implementing complicated tasks (e.g. multiple axes involved) in DOM way while XPath is guaranteed optimized. Therefore, my question only cares about the simple selections that can be done in both ways.
Thanks for any comment.
XPath and DOM are both specifications, not implementations. You can't ask questions about the performance of a spec, only about specific implementations. There's at least a ten-to-one difference between a fast XPath engine and a slow one: and they may be optimized for different things, e.g. some spend a lot of time optimizing a query on the assumption it will be executed multiple times, which might be the wrong thing to do for single-shot execution. The one thing one can say is that the performance of XPath depends more on the engine you are using, and the performance of DOM depends more on the competence of the application programmer, because it's a lower-level interface. Of course all programmers consider themselves to be much better than average...
This page has a section where you can run tests to compare the two and see the results in different browsers. For instance, for Chrome, xpath is 100% slower than getElementById.
See getElementById vs QuerySelector for more information.
I agree with Michael that it may depends on implementation, but I would generally say that DOM is faster. The reason is because there is no way that I see you can optimize the parsed document to make XPath faster.
If you're traversing HTML and not XML, specialized parser is able to index all the ids and classes in the document. This will make getElementById and getElementsByClass much faster.
With XPath, there's only one way to find the element of that id...by traversing, either top down or bottom up. You may be able to memoize repeated queries (or partial queries), but I don't see any other optimization that can be done.

Looking for better performance in XSLT: xsl:element name="div" vs. <div>

What has more performance in XSLT while writing an XHTML element
<xsl:element name="div">
<xsl:attribute name="class">someclass</xsl:attribute>
</xsl:element>
or just write it out
<div class="someclass"></div>
Does it make any difference in processing-speed / -performance etc. ?
I suspected that XSLT compilers probably convert one into the other internally and, sure enough, at least some of them do:
Literal result elements now compile
internally into xsl:element and
xsl:attribute instructions. This
results in changes to trace output:
each attribute is now traced as a
separate instruction.
More generally, this smells like the kind of micro-optimization that's unlikely to render an improvement that outweighs the benefits of choosing the more readable version.
Next to any XSL transformer will map both variants into the same internal representation. Just tested a million calls with saxonb-xslt 9: Absolutely no difference.
Readability is important only in the development cycle. High load sites that use dynamic presentation with XSL will want to shave off seemingly impossible small times.
To test which is faster in your compiler, create two XSL that repeat the same thing 10,000 times and then benchmark the processing speed on the front end. Then divide time difference by 10,000 and you get your true difference in speed.
Depending on the size of the XSL will also have an impact. A fully designed page in xsl:element will be so much larger than straight HTML. However if you use XSL in that method, you should rethink breaking down specific data into XML/XSL includes and the rest of the page templating is done somewhere else.
Whether writing out the element has better performance depends on the XSLT processor. The processor may optimize the first to have the same performance as the second. Having implemented an XSLT processor, I'd advise that writing the element out instead of using xsl:element is likely to be either as fast or faster, and unlikely to be slower.
You are not going to obtain any sensible performance gain from such refactoring (can you feel a microsecond's difference)?
However, I strongly recommend to use the most readable version:
<div class="someclass"/>
Even in the case when the element has attributes whose value is dynamically calculated, always try to write:
<someElement attr="{someExpression}"/>
instead of:
<someElement>
<xsl:attribute name="attr">
<xsl:value-of select="someExpression"/>
</xsl:attribute>
</someElement>

Resources