Xpath: All nodes until a node ( Wikiquote.org ) - ruby

DOCUMENT: http://en.wikiquote.org/wiki/The_Matrix
I'd want to get all quotes (//ul/li) of the first section (Neo's quotes).
I cannot do //ul[1]/li because in some wikiquote's pages a quote is represented in this form
<h2><span class="mw-headline" id="Neo">Neo</span></h2>
<ul>
<li> First quote </li>
</ul>
<ul>
<li> Second quote </li>
</ul>
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>
Instead of
<ul>
<li> First quote </li>
<li> Second quote </li>
</ul>
I've tried this to get the first section
(//*[#id='mw-content-text']/ul/preceding-sibling::h2/span[#class='mw-headline'])[1]
but I having problem to get only the quotes of the first section. May you help me?

Use:
(//h2[span/#id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
This selects all li that immediately follow the first h2 with a span child that has an id attribute with value "Neo".
To select the qoutatations for the second such h2, simply replace in the above expression 1 with 2.
Do this for all numbers: 1,2, ..., count(//h2[span/#id='Neo'])
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/#id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<html>
<h2><span class="mw-headline" id="Neo">Neo</span></h2>
<ul>
<li> First quote </li>
</ul>
<ul>
<li> Second quote </li>
</ul>
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2> >
</html>
the XPath expression is evaluated, and the selected nodes are copied to the output:
<li> First quote </li>
<li> Second quote </li>
Explanation:
This follows from the Kayessian (by Dr. Michael Kay) formula for intersection of two node-sets:
$ns1[count(.|$ns2) = count($ns2)]
the above selects exactly all nodes that belong both to the nodeset $ns and the nodeset $ns2.
So, we substitute $ns1 with the nodeset consisting of all following siblings ul of the h2 of interest. We substitute $ns2 with the nodeset consisting of all preceding siblings ul of the h2 that is the immediate (1st) following sibling of the h2 of interest.
The intersection of these two nodesets contains exactly all ul elements that are wanted.
Update: In a comment the OP states that he only knows that he wants the results to be from the first section -- the string "Neo" isn't known.
Here is the modified solution:
(//h2[span/#id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
The variable $vSectionId must be obtained as the string value of the following XPath expression:
substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/#href,
2)
Here we are getting the wanted id from the href of the a in the first Table Of Contents entry, and skipping the first character "#".
Here is again an XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="vSectionId" select=
"substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/#href,
2)
"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/#id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the complete XML document that is at:
http://en.wikiquote.org/wiki/The_Matrix, the result of applying these two XPath expressions (substituting the result of the first in the second, then evaluating the second expression) is the wanted, correct one:
<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>

Using the API will make it MUCH easier to parse. Here's a query that will pull the first section:
http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix&section=1&prop=wikitext
Output:
<?xml version="1.0"?>
<api>
<parse title="The Matrix">
<wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]
* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.
* Whoa.
* I know kung-fu.
* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.
* Guns.. lots of guns...
* There is no spoon.
* My name...is Neo!</wikitext>
</parse>
</api>
Here's one way to parse this (using HTTParty):
require 'httparty'
class Wikiquote
include HTTParty
base_uri 'en.wikiquote.org/w/'
def self.get_quotes(page)
url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
headers = {"User-Agent" => "Wikiquote scraper 1.0"}
content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
return content.scan(/^\* (.*)$/).flatten
end
end
Usage:
Wikiquote.get_quotes("The_Matrix")
Output:
["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
"Whoa.",
"I know kung-fu.",
"Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
"Guns.. lots of guns...",
"There is no spoon. ",
"My name...is Neo!"]

I suggest //ul[preceding-sibling::h2[1][span/#id = 'Neo']]/li. Or if the id attribute also not present respectively not relevant for the search, then based on the answer in a comment I think you want
(//h2[span[contains(#class, 'mw-headline')]])[1]/following-sibling::ul
[1 = count(preceding-sibling::h2[1] | (//h2[span[contains(#class, 'mw-headline')]])[1])]/li
See XPath axis, get all following nodes until for an explanation and I hope I have managed to close all brackets and braces correctly, don't have time now to test.

Related

Split methods on XPath 1.0

I use 'XPath', how I can simulate split method?
I read documentation and I know that XPath version 1.0 not have this method.
I have document contains this tags:
<TestCategoryModule>
<ItemCategories>
<![CDATA[Birthday Travel,Travel]]>
</ItemCategories>
</TestCategoryModule>
<TestCategoryModule2>
<ItemCategories>
<![CDATA[Travel]]>
</ItemCategories>
</TestCategoryModule2>
I want filter item by 'ItemCategories', but when I filtered by world 'Travel', return 2 item. I use this filter "ItemCategories[contains(text(), 'Travel')]".
I want that I filter by "Travel" return only second item. How can do it?
Use:
/*/*/*[contains(concat(',', ., ','), ',Travel,')]
Here is XSLT-based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/*/*[contains(concat(',', ., ','), ',Travel,')]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on this XML document (essentially the provided XML fragment, extended with one more test case and made a well-formed XML document:
<t>
<TestCategoryModule>
<ItemCategories>Birthday Travel,Travel</ItemCategories>
</TestCategoryModule>
<TestCategoryModule2>
<ItemCategories>Birthday Travel</ItemCategories>
</TestCategoryModule2>
<TestCategoryModule2>
<ItemCategories>Travel</ItemCategories>
</TestCategoryModule2>
</t>
The wanted, correct result is produced:
<ItemCategories>Birthday Travel,Travel</ItemCategories>
<ItemCategories>Travel</ItemCategories>
I was a little wrong, or poorly described problumu. The problem is that the categories are stored as a string. I have three items, the first one contains categories: (Birthday Travel,Travel), second: (Birthday Travel), third: (Travel). When I request filtering for the word "Travel", I need to get the first and third items, but I get all three items, because all items contain world "Travel".
You actually don't need split() for the problem that you've described. If you want to match Travel but not Travel,Travel you want = instead of contains(). To deal with the whitespace around your CDATA sections, wrap it in normalize-space().
All put together, try ItemCategories[normalize-space(text()) = 'Travel'].

xpath with multiple predicates equivalence

I was told that the following are not the same:
a[1][#attr="foo"]
a[#attr="foo"][1]
Can someone explain why that is the case?
Think of XPath expressions as defining a result set1 - a set of nodes that fulfil all the requirements stated in the XPath expression. The predicates of XPath expressions (the parts inside []) either have no effect on the result set or they incrementally narrow it.
Put another way, in the following expression:
//xyz[#abc="yes"]
[#abc="yes"] reduces the result set defined to the left of it, by //xyz.
Note that, as Michael Kay has suggested, all that is said below only applies to XPath expressions with at least one positional predicate. Positional predicates are either a number: [1] or evaluate to a number, or contain position() or last().
If no positional predicate is present, the order of predicates in XPath expressions is not significant.
Consider the following simple input document:
<root>
<a attr="other"/>
<a attr="foo"/>
<a attr="other"/>
<a attr="foo"/>
</root>
As you can see, a[#attr = 'foo'] is not the first child element of root. If we apply
//a[1]
to this document, this will of course result in
<a attr="other"/>
Now, crucially, if we add another predicate to the expression, like so:
//a[1][#attr="foo"]
Then, [#attr="foo"] can only influence the result set defined by //a[1] already. In this result set, there is no a[#attr="foo"] - and the final result is empty.
On the other hand, if we start out with
//a[#attr="foo"]
the result will be
<a attr="foo"/>
-----------------------
<a attr="foo"/>
and in this case, if we add a second predicate:
//a[#attr="foo"][1]
the second predicate [1] can narrow down the result set of //a[#attr="foo"] to only contain the first of those nodes.
If you know XSLT, you might find an XSLT (and XPath 2.0) proof of this helpful:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<result1>
<xsl:copy-of select="//a[1][#attr='foo']"/>
</result1>
<result2>
<xsl:copy-of select="//a[#attr='foo'][1]"/>
</result2>
</xsl:template>
</xsl:transform>
And the result will be
<result1/>
<result2>
<a attr="foo"/>
</result2>
1 Technically speaking, only XPath 1.0 calls this result a node-set. In XPath 2.0, all sets have become true sequences of nodes.

Select adjacent sibling elements without intervening non-whitespace text nodes

Given markup like:
<p>
<code>foo</code><code>bar</code>
<code>jim</code> and then <code>jam</code>
</p>
I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.
Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.
Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.
Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.
Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]
Edit: More test cases, for clarity:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.
The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.
I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.
Use:
//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
the contained XPath expression is evaluated and the selected nodes are copied to the output:
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
//code[
(
following-sibling::node()[1][self::code]
or (
following-sibling::node()[1][self::text() and normalize-space() = ""]
and
following-sibling::node()[2][self::code]
)
)
or (
preceding-sibling::node()[1][self::code]
or (
preceding-sibling::node()[1][self::text() and normalize-space() = ""]
and
preceding-sibling::node()[2][self::code]
)
)
]
I think this does what you want, though I won’t claim you’d actually want to use it.
I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.
I think this is what you want:
/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]

how do I formulate this xpath expression?

given the following div element
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?
If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)
More efficient -- requires no union:
//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]
An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.
An example of such inefficient expression:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:
<span class="b">456</span>
When the same transformation is applied on a different XML document, where there is no class='b':
<div class="info">
title
<span class="a">123</span>
<span class="x">456</span>
<span class="c">789</span>
</div>
the same XPath expression is evaluated and the correctly selected element is copied to the output:
<span class="a">123</span>
The xpath expression should be something like:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.
See here for selecting nodes with not() and here for combining results with the | operator.
Also, to refer to the second part of your question have a look here.
Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by
//div/node()
for future processing by other means.
An expression that works on your input without the union operator:
//div/span[#class='a' or #class='b'][count(../span[#class='b']) + 1]
This is just for fun. I'd probably use something more like #inVader's answer in production code.

XPath - Get node with no child of specific type

XML: /A/B or /A
I want to get all A nodes that do not have any B children.
I've tried
/A[not(B)]
/A[not(exists(B))]
without success
I prefer a solution with the syntax /*[local-name()="A" and .... ], if possible. Any ideas that works?
Clarification. The xml looks like:
<WhatEver>
<A>
<B></B>
</A>
</WhatEver>
or
<WhatEver>
<A></A>
</WhatEver>
Maybe
*[local-name() = 'A' and not(descendant::*[local-name() = 'B'])]?
Also, there should be only one root element, so for /A[...] you're either getting all your XML back or none. Maybe //A[not(B)] or /*/A[not(B)]?
I don't really understand why /A[not(B)] doesn't work for you.
~/xml% xmllint ab.xml
<?xml version="1.0"?>
<root>
<A id="1">
<B/>
</A>
<A id="2">
</A>
<A id="3">
<B/>
<B/>
</A>
<A id="4"/>
</root>
~/xml% xpath ab.xml '/root/A[not(B)]'
Found 2 nodes:
-- NODE --
<A id="2">
</A>
-- NODE --
<A id="4" />
Try this "/A[not(.//B)]" or this "/A[not(./B)]".
The first / causes XPath to start at the root of the document, I doubt that is what you intended.
Perhaps you meant //A[not(B)] which would find all A nodes in the document at any level that do not have a direct B child.
Or perhaps you are already at a node that contains A nodes in which case you just want A[not(B)] as the XPath.
If you are trying to get A anywhere in the hierarchy from the root, this works (for xslt 1.0 as well as 2.0 in case its used in xslt)
//descendant-or-self::node()[local-name(.) = 'a' and not(count(b))]
OR you can also do
//descendant-or-self::node()[local-name(.) = 'a' and not(b)]
OR also
//descendant-or-self::node()[local-name(.) = 'a' and not(child::b)]
There are n no of ways in xslt to achieve the same thing.
Note: XPaths are case-sensitive, so if your node names are different (which I am sure, no one is gonna use A, B), then please make sure the case matches.
Use this:
/*[local-name()='A' and not(descendant::*[local-name()='B'])]

Resources