How can I optimize an XPath expression? - xpath

Is there any way I can shorten the following condition used in an XPath expression?
(../parent::td or ../parent::ol or ../parent::ul)
The version of XPath is 1.0.

The shortest is probably
../..[self::td|self::ol|self::ul]
Whether there is a performance difference between "|" and "or" will depend on the processor, but I suspect that in most cases it won't be noticeable. For performance, the important thing is to put the conditions in the right order (the one most likely to return true should come first). Factoring out the navigation to the grandparent should almost certainly help performance, with the caveats (a) your XPath engine may do this optimization automatically, and (b) the difference will probably be so tiny you will have trouble measuring it.

use the '|' operator.
(../parent::td|../parent::ol|../parent::ul)

Slightly shorter:
../..[self::td or self::ol or self::ul]
Example usage:
//p[../..[self::td or self::ol or self::ul]]

Related

Replacement for descendant-or-self

I have a xpath $x/descendant-or-self::*/#y which I have changed to $x//#y as it improved the performance.
Does this change have any other impact?
As explained in the W3C XPath Recommendation, // is short-hand for /descendant-or-self::node()/, so that is a slight difference. But since attributes can only occur on elements, I think this replacement is safe.
That might also explain why you see a performance boost, since MarkLogic will need to worry less whether there really are elements in between.
HTH!

XPath performance & versions

I have 3 questions:
1) Is XPath string "//table[position()=8 or position()=10]/td[1]/span[2]/text()" faster than the XPath string "//table[8]/td[1]/span[2]/text() | //table[10]/td[1]/span[2]/text()"?
I use XPath with .NET CSharp and HTMLAgilityPack.
2) How can I determine what version of XPath I use. If I use XPath 1.0, how to upgrade to XPath 2.0?
3) Is there a performance optmimization and improvement into XPath 2.0 or just new features and new syntax?
XPath 2.0 expands significantly on XPath 1.0 (read here for a summary), though you don't need to switch unless you would benefit from the new functionality.
As for which one would be faster I believe the first one would be faster because you're repeating the node search in the second case. The first case is also more readable, and in general you want to go with the more readable one anyways.
As to the performance question, I'm afraid I don't know. It depends on the optimizer in the particular XPath processor you are using. If it's important to you, measure it. If it's not important enough to measure, then it's not important enough to worry about.
As I mentioned in my previous reply, //table[8] smells wrong to me. I think it's much more likely that you want (//table)[8]. (Both are valid XPath expressions, but they produce different answers).
You can probably assume that a processor is XPath 1.0 unless it says otherwise - if it supports 2.0, they'll want you to know. But you can easily test, for example by seeing what happens when you do //a except //b.
There's no intrinsic reason why an XPath 2.0 processor should be faster than a 1.0 processor on the same queries. In fact, it might be a bit slower, because it's required to do more careful type-checking. On the other hand it might be a lot faster, because many 1.0 processors were dashed off very quickly and never upgraded. But there are massive improvements in functionality in 2.0, for example regular expression support.

Performance of XPath vs DOM

Would anyone enlighten me some comprehensive performance comparison between XPath and DOM in different scenarios? I've read some questions in SO like xPath vs DOM API, which one has a better performance and XPath or querySelector?. None of them mentions specific cases. Here's somethings I could start with.
No iteration involved. getElementById(foobar) vs //*[#id='foobar']. Is former constantly faster than latter? What if the latter is optimized, e.g. /html/body/div[#id='foo']/div[#id='foobar']?
Iteration involved. getElementByX then traverse through child nodes vs XPath generate snapshot then traverse through snapshot items.
Axis involved. getElementByX then traverse for next siblings vs //following-sibling::foobar.
Different implementations. Different browsers and libraries implement XPath and DOM differently. Which browser's implementation of XPath is better?
As the answer in xPath vs DOM API, which one has a better performance says, average programmer may screw up when implementing complicated tasks (e.g. multiple axes involved) in DOM way while XPath is guaranteed optimized. Therefore, my question only cares about the simple selections that can be done in both ways.
Thanks for any comment.
XPath and DOM are both specifications, not implementations. You can't ask questions about the performance of a spec, only about specific implementations. There's at least a ten-to-one difference between a fast XPath engine and a slow one: and they may be optimized for different things, e.g. some spend a lot of time optimizing a query on the assumption it will be executed multiple times, which might be the wrong thing to do for single-shot execution. The one thing one can say is that the performance of XPath depends more on the engine you are using, and the performance of DOM depends more on the competence of the application programmer, because it's a lower-level interface. Of course all programmers consider themselves to be much better than average...
This page has a section where you can run tests to compare the two and see the results in different browsers. For instance, for Chrome, xpath is 100% slower than getElementById.
See getElementById vs QuerySelector for more information.
I agree with Michael that it may depends on implementation, but I would generally say that DOM is faster. The reason is because there is no way that I see you can optimize the parsed document to make XPath faster.
If you're traversing HTML and not XML, specialized parser is able to index all the ids and classes in the document. This will make getElementById and getElementsByClass much faster.
With XPath, there's only one way to find the element of that id...by traversing, either top down or bottom up. You may be able to memoize repeated queries (or partial queries), but I don't see any other optimization that can be done.

Are there any cases where LINQ's .Where() will be faster than O(N)?

Think the title describes my thoughts pretty well :)
I've seen a lot of people lately that swear to LINQ, and while I also believe it's awesome, I also think you shouldn't be confused about the fact that on most (all?) IEnumerable types, it's performance is not that great. Am I wrong in thinking this? Especially queries where you nest Where()'s on large datasets?
Sorry if this question is a bit vague, I just want to confirm my thoughts in that you should be "cautious" when using LINQ.
[EDIT] Just to clarify - I'm thinking in terms of Linq to Objects here :)
It depends on the provider. For Linq to Objects, it's going to be O(n), but for Linq to SQL or Entities it might ultimately use indices to beat that. For Objects, if you need the functionality of Where, you're probably going to need O(n) anyway. Linq will almost certainly have a bigger constant, largely due to the function calls.
It depends on how you are using it and to what you compare.
I've seen many implementations using foreaches which would have been much faster with linq. Eg. because they forget to break or because they return too many items. The trick is that the lambda expressions are executed when the item is actually used. When you have First at the end, it could end up it just one single call.
So when you chain Wheres, if an item does not pass the first condition, it will also not be tested for the second condition. it's similar to the && operator, which does not evaluate the condition on the right side if the first is not met.
You could say it's always O(N). But N is not the number of items in the source, but the minimal number of items required to create the target set. That's a pretty good optimization IMHO.
Here's a project that promises to introduce indexing for LINQ2Objects. This should deliver better asymptotic behavior: http://i4o.codeplex.com/

Prolog, fail and do not backtrack

Is there any build-in predicate in SWI-Prolog that will always fail AND prevent machine from backtracking - it is stop the program from executing immediately (this is not what fail/0 does)?
I could use cuts, but I don't like them.
Doing something like !, fail is not a problem for me, but in order to accomplish what I want, I would have to use cuts in more locations and this is what I don't like.
You could use exceptions. Based on your question - it should help.
Refer link
You could use the mechanism explicitly designed to help you accomplish something, but you don't like it?
You can always use not, which is syntactic sugar for cut fail
Two alternatives come to mind:
Pass around a backtrack(true) or backtrack(false) term through the code you want to control, and interpret it in the definition of the predicates you're writing to fail quickly if it is set to backtrack(false), or to continue if backtrack(true). Note that this won't actually prevent backtracking; it should just enable fast-failure. Even if your proof tree is deep, this should provide a fast way of preventing the execution of certain code on backtracking.
Use exceptions, as suggested by #Xonix (+1). Throwing an exception will terminate the proof tree construction immediately, and you can pass any term data through the exception up to the handler, bypassing any more execution - it will probably be faster than the first option, but may not be as portable.
Personally I've used both methods before - the first where I've anticipated the need before writing the code, the latter where I haven't.
Too bad, that's what cuts are for.

Resources