Select adjacent sibling elements without intervening non-whitespace text nodes - ruby

Given markup like:
<p>
<code>foo</code><code>bar</code>
<code>jim</code> and then <code>jam</code>
</p>
I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.
Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.
Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.
Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.
Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]
Edit: More test cases, for clarity:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.
The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.
I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.

Use:
//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
the contained XPath expression is evaluated and the selected nodes are copied to the output:
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>

//code[
(
following-sibling::node()[1][self::code]
or (
following-sibling::node()[1][self::text() and normalize-space() = ""]
and
following-sibling::node()[2][self::code]
)
)
or (
preceding-sibling::node()[1][self::code]
or (
preceding-sibling::node()[1][self::text() and normalize-space() = ""]
and
preceding-sibling::node()[2][self::code]
)
)
]
I think this does what you want, though I won’t claim you’d actually want to use it.
I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.

I think this is what you want:
/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]

Related

xpath with multiple predicates equivalence

I was told that the following are not the same:
a[1][#attr="foo"]
a[#attr="foo"][1]
Can someone explain why that is the case?
Think of XPath expressions as defining a result set1 - a set of nodes that fulfil all the requirements stated in the XPath expression. The predicates of XPath expressions (the parts inside []) either have no effect on the result set or they incrementally narrow it.
Put another way, in the following expression:
//xyz[#abc="yes"]
[#abc="yes"] reduces the result set defined to the left of it, by //xyz.
Note that, as Michael Kay has suggested, all that is said below only applies to XPath expressions with at least one positional predicate. Positional predicates are either a number: [1] or evaluate to a number, or contain position() or last().
If no positional predicate is present, the order of predicates in XPath expressions is not significant.
Consider the following simple input document:
<root>
<a attr="other"/>
<a attr="foo"/>
<a attr="other"/>
<a attr="foo"/>
</root>
As you can see, a[#attr = 'foo'] is not the first child element of root. If we apply
//a[1]
to this document, this will of course result in
<a attr="other"/>
Now, crucially, if we add another predicate to the expression, like so:
//a[1][#attr="foo"]
Then, [#attr="foo"] can only influence the result set defined by //a[1] already. In this result set, there is no a[#attr="foo"] - and the final result is empty.
On the other hand, if we start out with
//a[#attr="foo"]
the result will be
<a attr="foo"/>
-----------------------
<a attr="foo"/>
and in this case, if we add a second predicate:
//a[#attr="foo"][1]
the second predicate [1] can narrow down the result set of //a[#attr="foo"] to only contain the first of those nodes.
If you know XSLT, you might find an XSLT (and XPath 2.0) proof of this helpful:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="/">
<result1>
<xsl:copy-of select="//a[1][#attr='foo']"/>
</result1>
<result2>
<xsl:copy-of select="//a[#attr='foo'][1]"/>
</result2>
</xsl:template>
</xsl:transform>
And the result will be
<result1/>
<result2>
<a attr="foo"/>
</result2>
1 Technically speaking, only XPath 1.0 calls this result a node-set. In XPath 2.0, all sets have become true sequences of nodes.

Xpath: All nodes until a node ( Wikiquote.org )

DOCUMENT: http://en.wikiquote.org/wiki/The_Matrix
I'd want to get all quotes (//ul/li) of the first section (Neo's quotes).
I cannot do //ul[1]/li because in some wikiquote's pages a quote is represented in this form
<h2><span class="mw-headline" id="Neo">Neo</span></h2>
<ul>
<li> First quote </li>
</ul>
<ul>
<li> Second quote </li>
</ul>
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>
Instead of
<ul>
<li> First quote </li>
<li> Second quote </li>
</ul>
I've tried this to get the first section
(//*[#id='mw-content-text']/ul/preceding-sibling::h2/span[#class='mw-headline'])[1]
but I having problem to get only the quotes of the first section. May you help me?
Use:
(//h2[span/#id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
This selects all li that immediately follow the first h2 with a span child that has an id attribute with value "Neo".
To select the qoutatations for the second such h2, simply replace in the above expression 1 with 2.
Do this for all numbers: 1,2, ..., count(//h2[span/#id='Neo'])
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/#id='Neo'])[1]/following-sibling::ul
[count(.
|
(//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id='Neo'])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<html>
<h2><span class="mw-headline" id="Neo">Neo</span></h2>
<ul>
<li> First quote </li>
</ul>
<ul>
<li> Second quote </li>
</ul>
<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2> >
</html>
the XPath expression is evaluated, and the selected nodes are copied to the output:
<li> First quote </li>
<li> Second quote </li>
Explanation:
This follows from the Kayessian (by Dr. Michael Kay) formula for intersection of two node-sets:
$ns1[count(.|$ns2) = count($ns2)]
the above selects exactly all nodes that belong both to the nodeset $ns and the nodeset $ns2.
So, we substitute $ns1 with the nodeset consisting of all following siblings ul of the h2 of interest. We substitute $ns2 with the nodeset consisting of all preceding siblings ul of the h2 that is the immediate (1st) following sibling of the h2 of interest.
The intersection of these two nodesets contains exactly all ul elements that are wanted.
Update: In a comment the OP states that he only knows that he wants the results to be from the first section -- the string "Neo" isn't known.
Here is the modified solution:
(//h2[span/#id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
The variable $vSectionId must be obtained as the string value of the following XPath expression:
substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/#href,
2)
Here we are getting the wanted id from the href of the a in the first Table Of Contents entry, and skipping the first character "#".
Here is again an XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:variable name="vSectionId" select=
"substring(//div[h2='Contents']
/following-sibling::ul[1]
/li[1]/a/#href,
2)
"/>
<xsl:template match="/">
<xsl:copy-of select=
"(//h2[span/#id=$vSectionId])[1]
/following-sibling::ul
[count(.
|
(//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
=
count((//h2[span/#id=$vSectionId])[1]
/following-sibling::h2[1]
/preceding-sibling::ul
)
]
/li
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the complete XML document that is at:
http://en.wikiquote.org/wiki/The_Matrix, the result of applying these two XPath expressions (substituting the result of the first in the second, then evaluating the second expression) is the wanted, correct one:
<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>
Using the API will make it MUCH easier to parse. Here's a query that will pull the first section:
http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix&section=1&prop=wikitext
Output:
<?xml version="1.0"?>
<api>
<parse title="The Matrix">
<wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]
* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.
* Whoa.
* I know kung-fu.
* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.
* Guns.. lots of guns...
* There is no spoon.
* My name...is Neo!</wikitext>
</parse>
</api>
Here's one way to parse this (using HTTParty):
require 'httparty'
class Wikiquote
include HTTParty
base_uri 'en.wikiquote.org/w/'
def self.get_quotes(page)
url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
headers = {"User-Agent" => "Wikiquote scraper 1.0"}
content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
return content.scan(/^\* (.*)$/).flatten
end
end
Usage:
Wikiquote.get_quotes("The_Matrix")
Output:
["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
"Whoa.",
"I know kung-fu.",
"Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
"Guns.. lots of guns...",
"There is no spoon. ",
"My name...is Neo!"]
I suggest //ul[preceding-sibling::h2[1][span/#id = 'Neo']]/li. Or if the id attribute also not present respectively not relevant for the search, then based on the answer in a comment I think you want
(//h2[span[contains(#class, 'mw-headline')]])[1]/following-sibling::ul
[1 = count(preceding-sibling::h2[1] | (//h2[span[contains(#class, 'mw-headline')]])[1])]/li
See XPath axis, get all following nodes until for an explanation and I hope I have managed to close all brackets and braces correctly, don't have time now to test.

how do I formulate this xpath expression?

given the following div element
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?
If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)
More efficient -- requires no union:
//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]
An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.
An example of such inefficient expression:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select=
"//div/span
[#class='b'
or
#class='a'
and
not(parent::*[span[#class='b']])
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div class="info">
title
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>
</div>
The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:
<span class="b">456</span>
When the same transformation is applied on a different XML document, where there is no class='b':
<div class="info">
title
<span class="a">123</span>
<span class="x">456</span>
<span class="c">789</span>
</div>
the same XPath expression is evaluated and the correctly selected element is copied to the output:
<span class="a">123</span>
The xpath expression should be something like:
//div/span[#class='b'] | //div[not(./span[#class='b'])]/span[#class='a']
The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.
See here for selecting nodes with not() and here for combining results with the | operator.
Also, to refer to the second part of your question have a look here.
Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by
//div/node()
for future processing by other means.
An expression that works on your input without the union operator:
//div/span[#class='a' or #class='b'][count(../span[#class='b']) + 1]
This is just for fun. I'd probably use something more like #inVader's answer in production code.

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Select text from a node and omit child nodes

I need to select the text in a node, but not any child nodes.
the xml looks like this
<a>
apples
<b><c/></b>
pears
</a>
If I select a/text(), all I get is "apples". How would I retreive "apples pears" while omitting <b><c/></b>
Well the path a/text() selects all text child nodes of the a element so the path is correct in my view. Only if you use that path with e.g. XSLT 1.0 and <xsl:value-of select="a/text()"/> it will output the string value of the first selected node. In XPath 2.0 and XQuery 1.0: string-join(a/text()/normalize-space(), ' ') yields the string apples pears so maybe that helps for your problem. If not then consider to explain in which context you use XPath or XQuery so that a/text() only returns the (string?) value of the first selected node.
To retrieve all the descendants I advise using the // notation. This will return all text descendants below an element. Below is an xquery snippet that gets all the descendant text nodes and formats it like Martin indicated.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
return normalize-space(string-join($a//text(), " "))
Or if you have your own formatting requirements you could start by looping through each text element in the following xquery.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
for $txt in $a//text()
return $txt
If I select a/text(), all i get is
"apples". How would i retreive "apples
pears"
Just use:
normalize-space(/)
Explanation:
The string value of the root node (/) of the document is the concatenation of all its text-node descendents. Because there are white-space-only text nodes, we need to eliminate these unwanted text nodes.
Here is a small demonstration how this solution works and what it produces:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
'<xsl:value-of select="normalize-space()"/>'
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<a>
apples
<b><c/></b>
pears
</a>
the wanted, correct result is produced:
'apples pears'

Resources