Is there an XPath syntax to match, for instance, the occurrences numbered 2,3,5,7,11,13 of a certain kind of node? That is, the same result as the union of
//item[2]
//item[3]
//item[5]
...
but in a single expression.
(Use case: I am using a Genshi transformer to match and remove a set of nodes. I can't match and remove them in successive expressions, because their indices would change inbetween.)
You can try using XPath position() like for example :
//item[position()=2 or position()=3 or position()=5 ...]
or maybe using parentheses if I understand correctly what you mean by "global position number" :
(//item)[position()=2 or position()=3 or position()=5 ...]
Related
I have an element with three occurences on the page. If i match it with Xpath expression //div[#class='col-md-9 col-xs-12'], i get all three occurences as expected.
Now i try to rework the matching element on the fly with
substring-before(//div[#class='col-md-9 col-xs-12'], 'Bewertungen'), to get the string before the word "Bewertungen",
normalize-space(//div[#class='col-md-9 col-xs-12']), to clean up redundant whitespaces,
normalize-space(substring-before(//div[#class='col-md-9 col-xs-12'] - both actions.
The problem with last three expressions is, that they extract only the first occurence of the element. It makes no difference, whether i add /text() after matching definition.
I don't understand, how an addition of normalize-space and/or substring-before influences the "main" expression in the way it stops to recognize multiple occurences of targeted element and gets only the first. Without an addition it matches everything as it should.
How is it possible to adjust the Xpath expression nr. 3 to get all occurences of an element?
Example url is https://www.provenexpert.com/de-de/jazzyshirt/
The problem is that both normalize-space() and substring-before() have a required cardinality of 1, meaning can only accept one occurrence of the element you are trying to normalize or find a substring of. Each of your expressions results in 3 sequences which these two functions cannot process. (I probably didn't express the problem properly, but I think this is the general idea).
In light of that, try:
//div[#class='col-md-9 col-xs-12']/substring-before(normalize-space(.), 'Bewertung')
Note that in XPath 1.0, functions like substring-after(), if given a set of three nodes as input, ignore all nodes except the first. XPath 2.0 changes this: it gives you an error.
In XPath 3.1 you can apply a function to each of the nodes using the apply operator, "!": //div[condition] ! substring-before(normalize-space(), 'Bewertung'). That returns a sequence of 3 strings. There's no equivalent in XPath 1.0, because there's no data type in XPath 1.0 that can represent a sequence of strings.
In XPath 2.0 you can often achieve the same effect using "/" instead of "!", but it has restrictions.
When asking questions on StackOverflow, please always mention which version of XPath you are using. We tend to assume that if people don't say, they're probably using 1.0, because 1.0 products don't generally advertise their version number.
If I have two XPath queries where the second one is meant to further drill down the result of the first, can I safely let my script combine them into a single query by...
placing parenthesis around the first query,
prefixing the second query with with a slash, and then
simply concatenating the two strings ?
Context
The concrete usecase that sparked this question involves extracting information from XML/XHTML documents according to externally supplied pairs of "CSS selector + attribute name" using XPath behind the scenes.
For example the script may get the following as input:
selector: a#home, a.chapter
attribute: href
It then compiles the selector to an XPath query using the HTML::Selector::XPath Perl module, and the attribute by simply prefixing a # ... which in this case would yield:
XPath query 1: //a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')]
XPath query 2: #href
And then it repeatedly passes those queries to libxml2's XPath engine to extract the requested information (in this example, a list of URLs) from the XML documents in question.
It works, but I would prefer to combine the two queries into a single one, which would simplify the code for invoking them and reduce the performance overhead:
XPath query: (//a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')])/#href
(note the added parenthesis and slash)
But is this safe to do programmatically, for arbitrary input queries?
In general, no, you can't concatenate two arbitrary XPath expressions in this way, especially not in XPath 1.0. It's easy to find counter-examples: in XPath 1.0 you can't even have a union expression on the RHS of '/', so concatenating "/a" and "(b|c)" would fail.
In XPath 2.0, the result will always be syntactically valid, but in may contain type errors, e.g. if the expressions are "count(a)" and "b". The LHS operand of "/" must evaluate to a sequence of nodes.
Sure, this should work. However, you will always have to respect the correct context. If the elements in your example in the first query have no href attribute, you will get an empty result set.
Also, you will have to take care of e.g. a leading slash in front of your second query, so that you don't end up with a descendant-or-self axis step, which might not be what you want. Apart from that, this should always work - The worst that can happen that it is not logical correct (i.e. you don't get the expected result), but it should always be valid XPath.
If you have a String, lets assume: AB--AB. I want to look for nodes with xpath which can be AB??AB, meaning that the question marks in the node attribute are some kind of a placeholder - and they can vary in their amount of occurrence, so it should also be matched to AB?-AB for example.
How can you solve this?
XPath 2.0 has regular expression support: matches($string, 'AB.{0,2}AB') would match if the string is a sequence of the literal AB followed by zero, one or two arbitrary characters followed by the literal AB.
With XPath 1.0, you have to stick with substring(....), substring-before(....), substring-after(....), starts-with(...) and string-length(...). Sadly there even isn't an ends-with(...) function.
A possible solution to allow all strings starting and ending with "AB" and at least characters in between might be (I'm not totally sure on your needs):
//foo[
starts-with(., 'AB')
and
substring(., string-length(.)-1, 2) = 'AB'
and
string-length(.) >= 6
]
<bits>
<thing>Match this please</thing>
<thing>Don't match this</thing>
<thing>Match <b>this</b> please</thing>
</bits>
An expression like this:
//thing[text()='Match this please']
will locate the first 'thing' but not the third, because the phrase is distributed through a child node.
Is there an expression that would match the first and the third 'thing' in my example?
Try:
//thing[string()='Match this please']
jsfiddle:
http://jsfiddle.net/ZG9n3/2/
Please check the reference to see if this is going to work for you:
http://www.w3.org/TR/xpath/#function-string
Is there an expression that would
match the first and the third 'thing'
in my example?
You mean: Is there an expression that would select the first and the third element named thing, based on their string value.
Use:
/*/thing[. = 'Match this please']
The predicate compares the string value of the context node to the string "Match this please".
By definition the string value of an element is the concatenation (in document order) of all of its text-nodes descendents.
Note: Always try to avoid the // abbreviation because its use may incur big inefficiency. Whenever the structure of an XML document is known, use a chain of specific location steps.
I have XML documents like:
<rootelement>
<myelement>test1</myelement>
<myelement>test2</myelement>
<myelement type='specific'>test3</myelement>
</rootelement>
I'd like to retrieve the specific myelement, and if it's not present, then the first one. So I write:
/rootelement/myelement[#type='specific' or position()=1]
The XPath spec states about the 'or expression' that:
The right operand is not evaluated if
the left operand evaluates to true
The problem is that libxml2-2.6.26 seems to apply the union of both expressions, returning a "2 Node Set" (for example using xmllint --shell).
Is it libxml2 or am I doing anything wrong ?
Short answer: your selector doesn't express what you think it does.
The or operator is a union.
The part of the spec you quoted ("The right operand is not evaluated...") is part of standard boolean logic short circuiting.
Here's why you get a 2-node set for your example input: XPath looks at every myelement that's a child of rootelement, and applies the [#type='specific' or position()=1] part to each such node to determine whether or not it matches the selector.
<myelement>test1</myelement> does not match #type='specific', but it does match position()=1, so it matches the whole selector.
<myelement>test2</myelement> does not match #type='specific', and it also does not match position()=1, so it does not match the whole selector.
<myelement type='specific'>test3</myelement> matches #type='specific' (so XPath does not have to test its position - that's the short-circuiting part) so it matches the whole selector.
The first and last <myelement>s match the whole selector, so it returns a 2-node set.
The easiest way to select elements the way you want to is to do it in two steps. Here's the pseudocode (I don't know what context you're actually using XPath in, and I'm not that familiar with writing XPath-syntax selectors):
Select elements that match /rootelement/myelement[#type='specific']
If elements is empty, select elements that match /rootelement/myelement[position()=1]
#Matt Ball explained very well the cause of your problem.
Here is an XPath one-liner selecting exactly what you want:
/*/myelement[#type='specific'] | /*[not(myelement[#type='specific'])]/myelement[1]