Limit XPath matches (or choose match index) at location step - xpath

I want to limit the number of matches, or better yet, choose which indexed match, is made at a location step in XPath.
I know that if I have the following document:
<blah>
<foo>
<bar>
<target>one</target>
<target>two</target>
</bar>
<bar>
<target>three</target>
<target>four</target>
</bar>
</foo>
</blah>
I can limit the expression to only include the first <bar> directly under <foo> like the following, in order to only return <target>one</target> and <target>two</target>:
//foo/bar[1]/target
But now what if I have this:
<blah>
<foo>
<blah>
<bar>
<target>one</target>
<target>two</target>
</bar>
</blah>
<blah>
<bar>
<target>three</target>
<target>four</target>
</bar>
</blah>
</foo>
</blah>
I still only want the expression to yield <target>one</target> and <target>two</target>. But the following doesn't work; it will actually match <bar> under each <foo><blah>
//foo//bar[1]/target
Apparently there is a subtle distinction: bar[1] means "the first child of some element", not "the first element that matches this location step in the path expression".
How can I limit the expression (without naming the <blah> elements, because they can vary) so that when it reaches the //foo//bar location it only retains the first match of <bar> before it continues matching the expression?

Ah, with just a few more seconds of thought I found the answer! (Actually forming the question helped me discover the solution.)
The secret is simply to use parentheses:
(//foo//bar)[1]/target
If I'm understanding it correctly, the expression //foo//bar[1] says, "for every … <bar> you match, only take the first one of them for the parent", so if there are multiple "first-bars", it will take each of the "first-bars".
Instead the expression (//foo//bar)[1] says "once you match … <bar>, only take the first matching one; then evaluate the rest of the path from that one matching <bar>.
Whew! I wish every road bump could be this simple to get past.

Related

Why won't a longer token in an alternation be matched?

I am using ruby 2.1, but the same thing can be replicated on rubular site.
If this is my string:
儘管中國婦幼衛生監測辦公室制定的
And I do a regex match with this expression:
(中國婦幼衛生監測辦公室制定|管中)
I am expecting to get the longer token as a match.
中國婦幼衛生監測辦公室制定
Instead I get the second alternation as a match.
As far as I know it does work like that when not in chinese characters.
If this is my string:
foobar
And I use this regex:
(foobar|foo)
Returned matching result is foobar. If the order is in the other way, than the matching string is foo. That makes sense to me.
Your assumption that regex matches a longer alternation is incorrect.
If you have a bit of time, let's look at how your regex works...
Quick refresher: How regex works: The state machine always reads from left to right, backtracking where necessary.
There are two pointers, one on the Pattern:
(cdefghijkl|bcd)
The other on your String:
abcdefghijklmnopqrstuvw
The pointer on the String moves from the left. As soon as it can return, it will:
(source: gyazo.com)
Let's turn that into a more "sequential" sequence for understanding:
(source: gyazo.com)
Your foobar example is a different topic. As I mentioned in this post:
How regex works: The state machine always reads from left to right. ,|,, == ,, as it always will only be matched to the first alternation.
    That's good, Unihedron, but how do I force it to the first alternation?
Look!*
^(?:.*?\Kcdefghijkl|.*?\Kbcd)
Here have a regex demo.
This regex first attempts to match the entire string with the first alternation. Only if it fails completely will it then attempt to match the second alternation. \K is used here to keep the match with the contents behind the construct \K.
*: \K was supported in Ruby since 2.0.0.
Read more:
The Stack Overflow Regex Reference
On greedy vs non-greedy
Ah, I was bored, so I optimized the regex:
^(?:(?:(?!cdefghijkl)c?[^c]*)++\Kcdefghijkl|(?:(?!bcd)b?[^b]*)++\Kbcd)
You can see a demo here.

Regex negative lookbehinds with a wildcard

I'm trying to match some text if it does not have another block of text in its vicinity. For example, I would like to match "bar" if "foo" does not precede it. I can match "bar" if "foo" does not immediately precede it using negative look behind in this regex:
/(?<!foo)bar/
but I also like to not match "foo 12345 bar". I tried:
/(?<!foo.{1,10})bar/
but using a wildcard + a range appears to be an invalid regex in Ruby. Am I thinking about the problem wrong?
You are thinking about it the right way. But unfortunately lookbehinds usually have be of fixed-length. The only major exception to that is .NET's regex engine, which allows repetition quantifiers inside lookbehinds. But since you only need a negative lookbehind and not a lookahead, too. There is a hack for you. Reverse the string, then try to match:
/rab(?!.{0,10}oof)/
Then reverse the result of the match or subtract the matching position from the string's length, if that's what you are after.
Now from the regex you have given, I suppose that this was only a simplified version of what you actually need. Of course, if bar is a complex pattern itself, some more thought needs to go into how to reverse it correctly.
Note that if your pattern required both variable-length lookbehinds and lookaheads, you would have a harder time solving this. Also, in your case, it would be possible to deconstruct your lookbehind into multiple variable length ones (because you use neither + nor *):
/(?<!foo)(?<!foo.)(?<!foo.{2})(?<!foo.{3})(?<!foo.{4})(?<!foo.{5})(?<!foo.{6})(?<!foo.{7})(?<!foo.{8})(?<!foo.{9})(?<!foo.{10})bar/
But that's not all that nice, is it?
As m.buettner already mentions, lookbehind in Ruby regex has to be of fixed length, and is described so in the document. So, you cannot put a quantifier within a lookbehind.
You don't need to check all in one step. Try doing multiple steps of regex matches to get what you want. Assuming that existence of foo in front of a single instance of bar breaks the condition regardless of whether there is another bar, then
string.match(/bar/) and !string.match(/foo.*bar/)
will give you what you want for the example.
If you rather want the match to succeed with bar foo bar, then you can do this
string.scan(/foo|bar/).first == "bar"

How to match text sequences that continue through child nodes (e.g. with sgml-style markup)?

<bits>
<thing>Match this please</thing>
<thing>Don't match this</thing>
<thing>Match <b>this</b> please</thing>
</bits>
An expression like this:
//thing[text()='Match this please']
will locate the first 'thing' but not the third, because the phrase is distributed through a child node.
Is there an expression that would match the first and the third 'thing' in my example?
Try:
//thing[string()='Match this please']
jsfiddle:
http://jsfiddle.net/ZG9n3/2/
Please check the reference to see if this is going to work for you:
http://www.w3.org/TR/xpath/#function-string
Is there an expression that would
match the first and the third 'thing'
in my example?
You mean: Is there an expression that would select the first and the third element named thing, based on their string value.
Use:
/*/thing[. = 'Match this please']
The predicate compares the string value of the context node to the string "Match this please".
By definition the string value of an element is the concatenation (in document order) of all of its text-nodes descendents.
Note: Always try to avoid the // abbreviation because its use may incur big inefficiency. Whenever the structure of an XML document is known, use a chain of specific location steps.

xpath 'or' behaving like union ('|') with xmllib2

I have XML documents like:
<rootelement>
<myelement>test1</myelement>
<myelement>test2</myelement>
<myelement type='specific'>test3</myelement>
</rootelement>
I'd like to retrieve the specific myelement, and if it's not present, then the first one. So I write:
/rootelement/myelement[#type='specific' or position()=1]
The XPath spec states about the 'or expression' that:
The right operand is not evaluated if
the left operand evaluates to true
The problem is that libxml2-2.6.26 seems to apply the union of both expressions, returning a "2 Node Set" (for example using xmllint --shell).
Is it libxml2 or am I doing anything wrong ?
Short answer: your selector doesn't express what you think it does.
The or operator is a union.
The part of the spec you quoted ("The right operand is not evaluated...") is part of standard boolean logic short circuiting.
Here's why you get a 2-node set for your example input: XPath looks at every myelement that's a child of rootelement, and applies the [#type='specific' or position()=1] part to each such node to determine whether or not it matches the selector.
<myelement>test1</myelement> does not match #type='specific', but it does match position()=1, so it matches the whole selector.
<myelement>test2</myelement> does not match #type='specific', and it also does not match position()=1, so it does not match the whole selector.
<myelement type='specific'>test3</myelement> matches #type='specific' (so XPath does not have to test its position - that's the short-circuiting part) so it matches the whole selector.
The first and last <myelement>s match the whole selector, so it returns a 2-node set.
The easiest way to select elements the way you want to is to do it in two steps. Here's the pseudocode (I don't know what context you're actually using XPath in, and I'm not that familiar with writing XPath-syntax selectors):
Select elements that match /rootelement/myelement[#type='specific']
If elements is empty, select elements that match /rootelement/myelement[position()=1]
#Matt Ball explained very well the cause of your problem.
Here is an XPath one-liner selecting exactly what you want:
/*/myelement[#type='specific'] | /*[not(myelement[#type='specific'])]/myelement[1]

How do construct an xpath to select items that do not contain a string

How do I use something similar to the example below, but with the opposite result, i.e items that do not contain (default) in the text.
<test>
<item>Some text (default)</item>
<item>Some more text</item>
<item>Even more text</item>
</test>
Given this
//test/item[contains(text(), '(default)')]
would return the first item. Is there a not operator that I can use with contains?
Yes, there is:
//test/item[not(contains(text(), '(default)'))]
Hint: not() is a function in XPath instead of an operator.
An alternative, possibly better way to express this is:
//test/item[not(text()[contains(., '(default)')])]
There is a subtle but important difference between the two expressions (let's call them A and B, respectively).
Simple case: If all <item> only have a single text node child, both A and B behave the same.
Complex case: If <item> can have multiple text node children, expression A only matches when '(default)' occurs in the first of them.
This is because text() matches all text node children and produces a node-set. So far no surprise. Now, contains() accepts a node-set as its first argument, but it needs to convert it to string to do its job. And conversion from node-set to string only produces the string value of the first node in the set, all other nodes are disregarded (try string(//item) to see what I mean). In the simple case this exactly what happens as well, but the result is not as surprising.
Expression B deals with this by explicitly checking every text node individually instead of only checking the string value of the whole <item> element. It's therefore the more robust of the two.

Resources