Select text from a node and omit child nodes - xpath

I need to select the text in a node, but not any child nodes.
the xml looks like this
<a>
apples
<b><c/></b>
pears
</a>
If I select a/text(), all I get is "apples". How would I retreive "apples pears" while omitting <b><c/></b>

Well the path a/text() selects all text child nodes of the a element so the path is correct in my view. Only if you use that path with e.g. XSLT 1.0 and <xsl:value-of select="a/text()"/> it will output the string value of the first selected node. In XPath 2.0 and XQuery 1.0: string-join(a/text()/normalize-space(), ' ') yields the string apples pears so maybe that helps for your problem. If not then consider to explain in which context you use XPath or XQuery so that a/text() only returns the (string?) value of the first selected node.

To retrieve all the descendants I advise using the // notation. This will return all text descendants below an element. Below is an xquery snippet that gets all the descendant text nodes and formats it like Martin indicated.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
return normalize-space(string-join($a//text(), " "))
Or if you have your own formatting requirements you could start by looping through each text element in the following xquery.
xquery version "1.0";
let $a :=
<a>
apples
<b><c/></b>
pears
</a>
for $txt in $a//text()
return $txt

If I select a/text(), all i get is
"apples". How would i retreive "apples
pears"
Just use:
normalize-space(/)
Explanation:
The string value of the root node (/) of the document is the concatenation of all its text-node descendents. Because there are white-space-only text nodes, we need to eliminate these unwanted text nodes.
Here is a small demonstration how this solution works and what it produces:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
'<xsl:value-of select="normalize-space()"/>'
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<a>
apples
<b><c/></b>
pears
</a>
the wanted, correct result is produced:
'apples pears'

Related

Split methods on XPath 1.0

I use 'XPath', how I can simulate split method?
I read documentation and I know that XPath version 1.0 not have this method.
I have document contains this tags:
<TestCategoryModule>
<ItemCategories>
<![CDATA[Birthday Travel,Travel]]>
</ItemCategories>
</TestCategoryModule>
<TestCategoryModule2>
<ItemCategories>
<![CDATA[Travel]]>
</ItemCategories>
</TestCategoryModule2>
I want filter item by 'ItemCategories', but when I filtered by world 'Travel', return 2 item. I use this filter "ItemCategories[contains(text(), 'Travel')]".
I want that I filter by "Travel" return only second item. How can do it?
Use:
/*/*/*[contains(concat(',', ., ','), ',Travel,')]
Here is XSLT-based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/*/*[contains(concat(',', ., ','), ',Travel,')]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on this XML document (essentially the provided XML fragment, extended with one more test case and made a well-formed XML document:
<t>
<TestCategoryModule>
<ItemCategories>Birthday Travel,Travel</ItemCategories>
</TestCategoryModule>
<TestCategoryModule2>
<ItemCategories>Birthday Travel</ItemCategories>
</TestCategoryModule2>
<TestCategoryModule2>
<ItemCategories>Travel</ItemCategories>
</TestCategoryModule2>
</t>
The wanted, correct result is produced:
<ItemCategories>Birthday Travel,Travel</ItemCategories>
<ItemCategories>Travel</ItemCategories>
I was a little wrong, or poorly described problumu. The problem is that the categories are stored as a string. I have three items, the first one contains categories: (Birthday Travel,Travel), second: (Birthday Travel), third: (Travel). When I request filtering for the word "Travel", I need to get the first and third items, but I get all three items, because all items contain world "Travel".
You actually don't need split() for the problem that you've described. If you want to match Travel but not Travel,Travel you want = instead of contains(). To deal with the whitespace around your CDATA sections, wrap it in normalize-space().
All put together, try ItemCategories[normalize-space(text()) = 'Travel'].

xpath with node(), how to express `node()[.//x]` condition?

I have a XPath that must match text and tags, except the tag <aa>; so,
./node()[name()!='aa']
is the correct xpath.
But it is insufficient for cases where tag aa is into the node, I need something like,
./node()[name()!='aa' and not(.//aa)]
but this xpath not works (!).
NOTE
I used
./*[not(self::aa or .//aa)] | ./text()
but it lost the original sequence order of the nodes. This problem is more evident when working with XSLT, example:
<xsl:for-each select="./*[not(self::aa or .//aa)] | ./text()">
<xsl:copy-of select="."/>
<xsl:for-each>
not works as expected (the order of nodes is not ensured). When using ./node() the order is always correct.
PS: with XSLT we have a solution using all the explained xpaths,
<xsl:for-each select="./node()[name()!='aa']">
<xsl:if test="not(.//aa)"><xsl:copy-of select="."/><xsl:if>
<xsl:for-each>
but the ideal/simplest one not works with the same result (when processing big and complex inputs),
<xsl:copy-of select="*[not(self::aa or .//aa)] | ./text()"/>
I'm imagining your file looks like:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<aa/>
<b>
<aa/>
</b>
<c>
<b>
<aa/>
</b>
</c>
<d/>
<e>
<b/>
</e>
</root>
Then the expression
//node()[not(descendant-or-self::aa)]
returns all nodes (including the whitespace text nodes) that are not themselves an <aa> element or have an <aa> descendant. Children of <aa> are matched as well.
You'll probably want to do something like
<xsl:copy-of select="node()[not(descendant-or-self::aa)]"/>

Is there any method to get any type of sibling of a particular node in Xpath 2.0

Is there any method to get any type of sibling of a particular node in Xpath 2.0
The axes "following-sibling" only supports for the same type of siblings.
Ex:
<node>
<b name="bold">abc</b>
<div>gef</div>
</node>
I want to select all the sibling of the <b name="bold">.
Is there any method to get any type of sibling of a particular node in Xpath 2.0
The axes following-sibling only supports for the same type of siblings.
Use:
following-sibling::node()
this select all siblings nodes of any type -- elements, text-nodes, processing-instruction nodes and comment nodes.
Here is a complete XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/*/b[#name='bold']/following-sibling::node()">
"<xsl:copy-of select="."/>"
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<node>
<b name="bold">abc</b>
<div>gef</div>
</node>
the XPath expression is applied (off the wanted element) and all selected three nodes are copied to the output:
"
"
"<div>gef</div>"
"
"
As we can see, all sibling nodes are selected -- a whitespace-only text node, a div element and another whitespace-only text node.
Do note: This is an XPath 1.0 expression and I don't believe XPath 2.0 adds any new feature for selecting siblings than what is already in XPath 1.0.
In case by "sibling" you mean something different than the meaning of "sibling" in XPath, then you must define precisely what you mean.
Not sure I understand the question, but how about:
//*[preceding-sibling::b]
That will get all previous siblings of the <b name="bold">abc</b> element. The * selects any type of element.
If you want all siblings:
//*[preceding-sibling::b or following-sibling::b]
And if you want to be more specific in how you select the b element:
//*[preceding-sibling::b[#name="bold"]]

Select adjacent sibling elements without intervening non-whitespace text nodes

Given markup like:
<p>
<code>foo</code><code>bar</code>
<code>jim</code> and then <code>jam</code>
</p>
I need to select the first three <code>—but not the last. The logic is "Select all code elements that have a preceding-or-following-sibling-element that is also a code, unless there exist one or more text nodes with non-whitespace content between them.
Given that I am using Nokogiri (which uses libxml2) I can only use XPath 1.0 expressions.
Although a tricky XPath expression is desired, Ruby code/iterations to perform the same on a Nokogiri document are also acceptable.
Note that the CSS adjacent sibling selector ignores non-element nodes, and so selecting nokodoc.css('code + code') will incorrectly select the last <code> block.
Nokogiri.XML('<r><a/><b/> and <c/></r>').css('* + *').map(&:name)
#=> ["b", "c"]
Edit: More test cases, for clarity:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
All the Y above should be selected. None of the N should be selected. The content of the <code> are used only to indicate which should be selected: you may not use the content to determine whether or not to select an element.
The context elements in which the <code> appear are irrelevant. They may appear in <li>, they may appear in <p>, they may appear in something else.
I want to select all the consecutive runs of <code> at once. It is not a mistake that there is a space character in the middle of one of sets of Y.
Use:
//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"//code
[preceding-sibling::node()[1][self::code]
or
preceding-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
preceding-sibling::node()[2][self::code]
or
following-sibling::node()[1][self::code]
or
following-sibling::node()[1]
[self::text()[not(normalize-space())]]
and
following-sibling::node()[2][self::code]
]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<section><ul>
<li>Go to <code>N</code> and
then <code>Y</code><code>Y</code><code>Y</code>.
</li>
<li>If you see <code>N</code> or <code>N</code> then…</li>
</ul>
<p>Elsewhere there might be: <code>N</code></p>
<p><code>N</code> across parents.</p>
<p>Then: <code>Y</code> <code>Y</code><code>Y</code> and <code>N</code>.</p>
<p><code>N</code><br/><code>N</code> elements interrupt, too.</p>
</section>
the contained XPath expression is evaluated and the selected nodes are copied to the output:
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
<code>Y</code>
//code[
(
following-sibling::node()[1][self::code]
or (
following-sibling::node()[1][self::text() and normalize-space() = ""]
and
following-sibling::node()[2][self::code]
)
)
or (
preceding-sibling::node()[1][self::code]
or (
preceding-sibling::node()[1][self::text() and normalize-space() = ""]
and
preceding-sibling::node()[2][self::code]
)
)
]
I think this does what you want, though I won’t claim you’d actually want to use it.
I’m assuming text nodes are always merged together so that there won’t be two adjacent to each other, which I believe is generally the case, but might not be if you’re doing DOM manipulations beforehand. I’ve also assumed that there won’t be any other elements between code elements, or that if there are they prevent selection like non-whitespace text.
I think this is what you want:
/p/code[not(preceding-sibling::text()[not(normalize-space(.)="")])]

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Resources