Avoid parentheses in path using XPath 1.0 - xpath

The following XML structure represents a website with many articles. Every article contains, among many other things, date of its creation and possibly arbitrarily many dates of its modification. I want to get the date of the last access (either creation or last modification) to every article using XPath 1.0.
<website>
<article>
<date><strong>22.11.2017</strong></date>
<edits>
<edit><strong>17.12.2017</strong></edit>
</edits>
</article>
<article>
<date><strong>17.4.2016</strong></date>
<edits></edits>
</article>
<article>
<date><strong>3.5.2011</strong></date>
<edits>
<edit><strong>4.5.2011</strong></edit>
<edit><strong>12.8.2012</strong></edit>
</edits>
</article>
<article>
<date><strong>12.2.2009</strong></date>
<edits></edits>
</article>
<article>
<date><strong>23.11.1987</strong></date>
<edits>
<edit><strong>3.4.2001</strong></edit>
<edit><strong>11.5.2006</strong></edit>
<edit><strong>13.9.2012</strong></edit>
</edits>
</article>
</website>
In other words, the expected output is:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
So far I've only created this path:
//article/*[self::date or self::edits/edit][last()]
that looks for date and nonempty edits nodes in every article and selects the latter one. But I don't know how to access the latest strong of every such selection and the naive //strong[last()] appended to the end of the path doesn't work.
I found a solution in XPath 2.0. Either of these paths should work, if I'm not mistaken:
//article/(*[self::date or self::edits/edit][last()]//strong)[last()]
//article/(*//strong)[last()]
Such use of parentheses within path is invalid in XPath 1.0 though.

This XPath 1.0 expression
/website/article/descendant::strong[parent::date|parent::edit][last()]
Selects the nodes:
<strong>17.12.2017</strong>
<strong>17.4.2016</strong>
<strong>12.8.2012</strong>
<strong>12.2.2009</strong>
<strong>13.9.2012</strong>
Tested in http://www.xpathtester.com/xpath/56d8f7bc4b9c8c064fdad16f22469026
Do note: position predicates acts over the context list.

Here is the simple xpath to get your output.
//article/descendant-or-self::strong[last()]

Related

XPath (1.0) Match consecutive elements until specific child or end

This is for XPath 1.0.
Here is an example of the mark up that I am matching against. The actual number of elements is not known ahead of time and thus varies, but following this sort of of pattern:
<div class="entry">
<p><iframe /></p>
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
<p><iframe /></p>
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
</div>
I am trying to to match every <p> that does not contain an <iframe>, up until the next <p> that does contain an <iframe> or until the end of the enclosing <div> element.
To make things slightly more complicated, for specific reasons I need to use each <iframe> as the base, a la //div[#class='entry']//iframe, so that each nodeset is based from
(//div[#class='entry']//iframe)[1]
(//div[#class='entry']//iframe)[2]
...
and thus, in this case, matching
<p>Text 1</p>
<p>Text 2</p>
<p>Test 3</p>
and
<p>
<a>Test 4</a>
<br />
<a>Test 5</a>
</p>
respectively.
I tried some of the following for testing to no avail:
(//div[#class='entry']//iframe)/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(or for testing):
(//div[#class='entry']//iframe)[1]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
(//div[#class='entry']//iframe)[2]/ancestor::p/following-sibling::p[preceding-sibling::p[iframe]]
and some variations thereof but what happens for the first set is it gets all <iframe>-less <p> elements all the way to the end instead of stopping at the next <p> that contains a <iframe>.
I've been at this for a while and even though I'm usually quite handy with this sort of thing, I can't quite work my way thorigh this one and none of the search results from Google and such have helped.
Thanks. Any help is always appreciated.
Edit: It can be assumed that there is only one occurrence of <div class="entry"> in the document.
What you are asking for can't be done in one single XPath 1.0 expression without help. The problem is that the question you want to ask is
Starting from an element X (the p-containing-an-iframe), find the other p elements for which that element's nearest preceding p-with-an-iframe is the original node X
If we had a variable $x holding a reference to the top-level context node (the p[iframe] we're starting from) then you could say something like the following (in XPath 2.0)
following-sibling::p[not(iframe)][preceding-sibling::p[iframe][1] is $x]
XPath 1.0 doesn't have an is operator to compare node identity but there are other proxies you can use for this, for example
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe])
= (count($x/preceding-sibling::p[iframe]) + 1)]
i.e. those following p elements that have one more preceding-sibling::p[iframe] than $x has.
The nub of the problem then is how to get at the outer context node from inside the inner predicate - pure XPath 1.0 has no way to do this. In XSLT you have the current() function, but otherwise you have two basic choices:
If your XPath library allows you to provide variable bindings to your expressions, then inject a variable $x containing the context node and use the expression I've given above.
If you can't inject variables then use two separate XPath queries in sequence.
First execute the expression
count(preceding-sibling::p[iframe]) + 1
with the relevant p[iframe] as context node, and take the result as a number. Or alternatively, if you're already iterating over these p[iframe] elements in your host language then just take the iteration number from there directly, you don't need to count it up using XPath. Either way, you can then build a second expression dynamically:
following-sibling::p[not(iframe)][count(preceding-sibling::p[iframe]) = N]
(where N is the result of the first expression/iteration counter) and evaluate that with the same context node, taking the final result as a node set.
I'm not sure I understood completely, but sometimes it helps to comment on an attempted solution rather than trying to explain.
Please try the following XPath expression:
//div[#class='entry']//iframe//p[not(descendant::iframe)]
And let me know if this yields the correct result.
If not,
explain how the result differs from what you need
please show a more complete HTML sample: a reasonable document with multiple div elements, and more than one where div[#class = 'entry'] - and otherwise covering all the complexity you describe.
explain why you added [1] and [2] to your expressions
give more details about the platform you're using XPath with, perhaps post code

Xpath expression to find element that do NOT have a matching ancestor

I'm trying to use xpath to extract HTML5 microdata from a page. I'm essentially trying to say "find nested nodes with an itemprop=name attribute that are not nested inside another itemscope element (at any depth)". Given the following example I'm trying to find the name of the product (shoes) but I don't want the brand name (Nike).
<div itemscope itemtype="http://schema.org/Product>
<div itemscope itemtype="http://schema.org/Brand">
<div itemprop="name">Nike</div> <!-- don't want this -->
</div>
<div itemprop="name">shoes</div> <!-- do want this -->
</div>
I can easily find the itemprop=name element by using something like //*[#itemprop=name] but this would also pull in the brand name. Btw the elements shown in the example may be nested inside other tags so I can't simple say "whose immediate parent does not have an itemscope attribute" I believe there may be something relating to ancestors that I can use but I don't know enough about xpath. Any ideas?
A single expression to find all the itemprop="name" elements with at most one itemscope ancestor would be
//*[#itemprop = 'name'][not(ancestor::*[#itemscope][2])]
If you wanted to start from one specific itemscope node and find the names that are nested specifically in it (and not a nested scope) then that's not something that you can do in one XPath 1.0 expression. You'd have to first extract its descendant names
.//*[#itemprop='name']
and then for each of those, find its nearest itemscope ancestor
ancestor::*[#itemscope][1]
and check (on the python side) whether or not that node is the same node as the one you started from. In XPath 2.0 you could do this in one with
for $me in . return (.//*[#itemprop='name'][ancestor::*[#itemscope][1] is $me])
but 1.0 doesn't have the for $x in Y return Z structure for binding variables, or the is operator to compare node identity.
Please give this a try:
//*[#itemprop = 'name' and not(ancestor::*[#itemscope][2])]

Xpath having multiple predicate statements

I've looked around and can't seem to find the answer for this.
Very simplified:
<a>
<b>
<div class=label>
<label="here"/>
</div>
</b>
<div id="something">
<b>
<div class=label>
<label="here"/>
</div>
</div>
</a>
so I'm trying to grab the second "here" label. What I want to do is do the id to get to the "something" part
//.[#id="something”]
and then from that point search for the label with something like
//.[#class="label" and label="here"]
But from reading a few other replies it doesn't appear that something like
//.[#id="something”][#class="label" and label="here"]
works and I was just wondering if I'm just missing something in how it's working? I know I can get the above really simply with another method, it's just an example to ask how to do two predicate statements after each other (if it is indeed possible).
Thanks!
I think you need something like this instead :
//.[#id="something”]//.[#class="label" and label="here"]
The point is that the // means : Selects nodes in the document from the current node that match the selection no matter where they are
ref : http://www.w3schools.com/xpath/xpath_syntax.asp
The syntax //*[#x='y'] is more idiomatic than //.[#x='y'], probably because it's valid in both XPath 1.0 and XPath 2.0, whereas the latter is only allowed in XPath 2.0. Disallowing predicates after "." was probably an accidental restriction in XPath 1.0, and I think some implementations may have relaxed the restriction, but it's there in the spec: a predicate can only follow either a NodeTest or a PrimaryExpr, and "." is neither.
In XPath 2.0, //* selects all element nodes in the tree, while //. selects all nodes of all kinds (including the document root node), but in this example the effect is the same because the predicate [#x='y'] can only be matched by an element node (for all other node kinds, #x selects nothing and therefore cannot be equal to anything).

How does empty start tag work in HTML4?

The HTML4 specification mentions various SGML shorthand markup constructs. While I understand what others do, with a help of HTML validator, I cannot find understand why anyone would want an empty start tag. It cannot even have attributes, so it's not a shorter <span>.
The SGML definition of HTML4 enables the empty start feature. In it, there is an interesting section with features.
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL YES
APPINFO NONE
The important section of features is MINIMIZE section. It enables OMITTAG which is a standard feature of HTML, which allows start or end tags to be ommited. This is particular allows you to write code like <p> a <p> b, without closing paragraphs.
The more important part is SHORTTAG feature, which is actually a category. However, because it's not expanded, the SGML automatically assumed YES for all entries in it. It has the following categories in it. Feel free to skip this list, if you aren't interested in other shorthand features in SGML.
ATTRIB, which deals with attributes, and has following options.
DEFAULT - defines whether attributes can contain default values. This allows writing <p> without defining every single attribute. Nobody would want to write <p id="" class="" style="" title="" lang="en" dir="ltr" onclick="" ondblclick="" ...></p> after all. Hey, I even gave up trying to write all that. This is a commonly supported feature.
OMITNAME - if the attribute and value have the same name, the value is optional. This allows writing <input type="checkbox" checked> for instance. This is a commonly supported feature (although, HTML5 defines default to be empty string, not an attribute name).
VALUE - allows writing values without quotes. This allows writing code like <p class=warning></p> for instance. This is a commonly supported feature.
ENDTAG, which is a category for end tags containing the following options.
UNCLOSED - allows starting a new tag before ending the previous tag, allowing code like <p><b></b</p>.
EMPTY - allows unnamed end tags, such as <b>something</>. They close most recent element which is still open.
STARTTAG, which is a category for start tags containing the following options.
NETENABL - allows using Null End Tag notation. It's worth noting this notation is incompatible with XHTML. Anyway, the feature allows writing code like <b/<i/hello//, which means the same thing as <b><i>hello</i></b>.
UNCLOSED - allows starting a new tag before ending the previous tag, allowing code like <p<b></b></p>.
EMPTY - this is the asked feature.
Now, it's important to understand what EMPTY does. While <> may appear useless at first (hey, how could you determine what it does, when nothing aside of Validator supports it), it's actually not. It opens the previous sibling, allowing code like the following.
<ul>
<li class=plus> hello world
<> another list element
<> yet another
<li class=minus> nope
<> what am I doing?
</ul>
In this example, the list has two classes, plus and minus for positive and negative arguments. However, the webmaster was lazy (and doesn't care about that HTML4 doesn't support this), and decided to use empty start tag in order to not specify the class for next elements. Because <li> has optional end tag, this automatically closed previous <li> tag.

xPath expression for attributes that don't have ancestors with same attribute

I'm trying to extract elements with an attribute, and not extract the descendants separately that have the same attribute.
Using the following html:
<html><body>
<div box>
some text
<div box>
some more text
</div>
</div>
<div box>
this needs to be included as well
</div>
</body></html>
I want to be able to extract the two outer <div box> and its descendants including the inner <div box>, but don't want to have the inner <div box> extracted separately.
I have tried using all sorts of different expressions but think I am missing something quite fundamental. The main expression I have been trying is: //[#box and not(ancestor::#box) but this still returns two elements.
I am trying to do this using the 'Hpricot' (0.8.3) Gem in Ruby 1.9.2 as follows:
# Assuming html is set to the html above
doc = Hpricot(html)
elements = doc.search('//[#box and not(ancestor::#box)]')
# The following is returning 3 instead of 2
elements.size
Any help on this would be great.
Your XPATH is invalid. You have to address something in order to use the predicate filter(e.g. []). Otherwise, there isn't anything to filter.
This XPATH works:
//div[#box and not(ancestor::div/#box)]
If the elements aren't all guarenteed to be <div>, you can use a more generic match for elements:
//*[#box and not(ancestor::*/#box)]
Using elements = doc.search('//[#box and not(ancestor::#box)]') isn't correct.
Use elements = doc.at('//div[#box]') which will find the first occurrence.
I'd recommend using Nokogiri over Hpricot. Nokogiri is well supported, very flexible and more robust.
EDIT: Added because original question changed:
Thanks that worked perfectly, except I forget to mention that I want to return multiple outer elements. Sorry about that, I have updated the question. I will look into Nokogiri further, I didn't choose it originally because Hpricot seemed more approachable.
Remember that XPath acts like accessing a file in a directory at its simplest form, so you can drill down and search in "subdirectories". If you only want the outer <div> tags, then look inside the <body> level and no further:
doc.search('/html/body/div')
or, if you might have unadorned div tags along with the targets:
doc.search('/html/body/div[#box]')
Regarding Hpricot seeming more approachable:
Nokogiri implements a superset of Hpricot's accessors, allowing you to drop it into place for most uses. It supports XPath and CSS accessors allowing more intuitive ways of getting at data if you live in CSS and HTML and don't grok XPath. In addition there are many methods to find your desired target:
doc.search('body > div[box]')
(doc / 'body > div[box]')
doc.css('body > div[box]')
Nokogiri supports the at and % synonym found in Hpricot also, along with css_at, if you only want the first occurrence of something.
I started using Nokogiri after running into some situations where Hpricot exploded because it couldn't handle malformed news-feeds in the wilds.

Resources