XPath: Get text that contains Obama but not Romney - xpath

I am quite new to XPath so bear with me. I have a XPath expression
'.//*[contains(.,"Obama")]/text()'
that gets me the text that contains "Obama". However, I haven't been able to figure out how to add
and [not(contains(., "Romney"))] to the expression without getting a syntax error. How is it done? Help much appriciated!

Use:
.//*[contains(.,"Obama") and not(contains(.,"Romney"))]/text()
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:copy-of select=
'.//*[contains(.,"Obama") and not(contains(.,"Romney"))]/text()'/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document:
<election>
<choice>Maybe Obama</choice>
<choice>Maybe Romney</choice>
</election>
the XPath expression is evaluated and the selected node is copied to the output:
Maybe Obama
Do note:
SomeExpression[x][y]
is not always equivalent to:
SomeExpression[x and y]
Therefore, it is recommended the latter -- not the former, as specified in the answer by #ChrisGerken.
Here is a concrete example:
Let's have this XML document:
<nums>
<num>01</num>
<num>02</num>
<num>03</num>
<num>04</num>
<num>05</num>
<num>06</num>
<num>07</num>
<num>08</num>
<num>09</num>
<num>10</num>
</nums>
and these two XPath expressions:
/*/*[. mod 3 = 0 and position() = 3]
and
/*/*[. mod 3 = 0][position() = 3]
The first expression selects:
<num>03</num>
However, the second expression selects:
<num>09</num>
And here is a complete XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:copy-of select=
"/*/*[. mod 3 = 0 and position() = 3]"/>
================
<xsl:copy-of select=
"/*/*[. mod 3 = 0][position() = 3]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the above XML document, the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
<num>03</num>
================
<num>09</num>
Explanation:
position() is a *context-sensitive` function and typically produces different results when used in the k-th and in the m-th predicate, where k != m

try this:
'.//*[contains(.,"Obama")][not(contains(.,"Romney"))]/text()'
You can put as many predicates as you like one after another:
[a][b][c]

Related

In XPath I want a expression to select all under div but those nodes that are not a. The tree structure must remain the same

How to exclude specific descendants of a node? In this direction, the expression *[not(self::nodetag)] seems just to discriminate at a child level of the node, accepting all other descedants in the returned node set. I want a expression to select all under div but those nodes that are not a, see example below. The tree structure must remain the same.
The approach poste by #Dimitri Novatchev seems to be right but not for HAP implementation:
Using this example document:
<div>
<span>
<a>lala</a>
</span>
</div>
The HAP would return the following structure with his suggested expression /div/descendant::node()[not(self::a)]
<div>
<span>
<a>lala</a>
</span>
</div>
<span>
<a>lala</a>
</span>
If there would be another tag other than a nested on span, it would also return it as a separte tree, any one know about this strange behavior? Is it a HAP bug?
Thanks
I want a expression to select all under div but those nodes that are
not a. The tree structure must remain the same.
Use:
/div/descendant::node()[not(self::a)]
This selects any descendant of the top element div that (the descendant) is not an a.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="/div/descendant::node()[not(self::a)]">
<xsl:value-of select="concat('
', position(), '. "')"/>
<xsl:copy-of select="."/>"
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<div>
<span>
<a>lala</a>
</span>
</div>
the XPath expression is evaluated and all selected nodes are output with proper formatting to make them well-visible:
1. "
"
2. "<span>
<a>lala</a>
</span>"
3. "
"
4. "lala"
5. "
"
6. "
"
As we can see, 6 nodes are selected -- one span element, four whitespace-only text nodes and one non-whitespace-only text node -- and none of them is an a.
Update:
In a comment the OP has clarified that he actually wants the XML document to be transformed into another, in which any a descendant of a div is omitted.
Here is one such transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
When this transformation is applied on the same XML document (above), the (wwhat I guess is) wanted result is produced:
<div>
<span/>
</div>
If we want to produce only the descendants of any div that has an a descendant, then we need almost the same transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="div[.//a]"><xsl:apply-templates/></xsl:template>
<xsl:template match="div//a"/>
</xsl:stylesheet>
The result of this applied to the same XML document as above is:
<span/>
#Devela: you are confusing the set of nodes selected by the XPath expression with the way that they are then displayed by the application that issued the request. It's quite common for an application to display a node by showing the whole subtree rooted at that node. So if your query is //div, and one of the selected div elements contains an <a> node as a descendant, the results will be shown including that <a> element. You can't change that by changing the XPath expression, because the XPath expression didn't select the <a> element; you can only change it by changing the way the results are displayed.
Now, if you want to display a <div> element that is like the <div> element in your source except that the <a> is omitted, then you are outside the scope of what XPath can do. XPath can only choose a subset of the nodes in your input tree, it can't create a modified tree. For that, you need XSLT or XQuery.

Position predicates with xmerl_xpath in Erlang

I'm trying to use xmerl_xpath to query a parsed XML document in Erlang, but I can't get the position predicates to work. Instead of returning the nth child element, I get the nth element of everything selected so far.
Sample code, where I'd like to extract the values 11 and 21 (the "first column"):
{XML,_} = xmerl_scan:string(
"<table>" ++
"<row><el>11</el><el>12</el></row>" ++
"<row><el>21</el><el>22</el></row>" ++
"</table>" ).
4 = length(xmerl_xpath:string( "//table/row/el", XML )). % OK
1 = length(xmerl_xpath:string( "(//table/row/el)[1]", XML )). % OK
1 = length(xmerl_xpath:string( "//table/row/el[1]", XML )). % Why not 2?
Is the result of the last query expected? What's the proper way, in the general case, to extract the nth child using xmerl_path?
(What I'm really trying to do is to parse HTML using mochiweb_html and query it using mochiweb_xpath, but the latter is essentially a wrapper around xmerl_xpath.)
I don't know Erlang, but the third XPath should indeed extract the first column - the two nodes <el>11</el> and <el>21</el> - I create a simple test XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<root>
<xsl:for-each select="//table/row/el[1]">
<xsl:copy-of select="."/>
</xsl:for-each>
</root>
</xsl:template>
</xsl:stylesheet>
that applied to
<table>
<row>
<el>11</el>
<el>12</el>
</row>
<row>
<el>21</el>
<el>22</el>
</row>
</table>
produces:
<root>
<el>11</el>
<el>21</el>
</root>
If this is not what you are getting I suspect that the library you are using is buggy.

XPath: how to extract first line of multi-line attribute value?

For example:
<doc>
<elem attr="firstLine &x0a; secondLine"/>
<elem attr="1stLine"/>
</doc>
The need is to get the first line, when there are many
From the example above, we want to get { 'firstLine', '1stLine'}
Thanks in advance
Use:
substring-before(/*/elem[1]/#attr, '
')
Here is an XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="substring-before(/*/elem[1]/#attr, '
')"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document (corrected to be well-formed !!!):
<doc>
<elem attr="firstLine
secondLine"/>
<elem attr="1stLine"/>
</doc>
it evaluates the XPath expression and outputs the result of this evaluation:
firstLine
If you're using XPath 2.0, you can use tokenize()

Return full text element (including child/descendant elements)

I'm trying to get the text from the first occurrence on the page of div/p, and only the first p. The <p> contains other tags (<b>, <a href>) and the returned text from <p> stops at any other tag. Is there a way to get this line to return all the text between <p> and </p>, even between embedded tags?
puts doc.xpath('html/body/div/p[1]/text()').first
Use:
string((//div/p)[1])
When this XPath expression is evaluated the result is the string value of the first p in the document that is a child of a div.
By definition the string value of an element is the concatenation (in document order) of all of its text-node descendents.
Therefore, you get exactly all the text in the subtree rooted by this p element, with any other nodes (elements, comments, PIs) skipped.
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="string(p)"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the following XML document (no such provided!):
<p>
Hello <b>
XML
World!</b>
</p>
the result of the evaluated XPath expression is output:
Hello XML
World!
Using Nokogiri as an alternative for more XPath you can use Nokogiri::XML::Node#inner_text:
puts doc.xpath('html/body/div/p[1]').inner_text

XPath to return default value if node not present

Say I have a pair of XML documents
<Foo>
<Bar/>
<Baz>mystring</Baz>
</Foo>
and
<Foo>
<Bar/>
</Foo>
I want an XPath (Version 1.0 only) that returns "mystring" for the first document and "not-found" for the second. I tried
(string('not-found') | //Baz)[last()]
but the left hand side of the union isn't a node-set
In XPath 1.0, use:
concat(/Foo/Baz,
substring('not-found', 1 div not(/Foo/Baz)))
If you want to handle the posible empty Baz element, use:
concat(/Foo/Baz,
substring('not-found', 1 div not(/Foo/Baz[node()])))
With this input:
<Foo>
<Baz/>
</Foo>
Result: not-found string data type.
Special case:
If you want to get 0 if numeric node is missing or empty, use sum(/Foo/Baz) function
#Alejandro provided the best XPath 1.0 answer, which has been known for years, since first used by Jeni Tennison almost ten years ago.
The only problem with this expression is its shiny elegance, which makes it difficult to understand by not only novice programmers.
In a hosted XPath 1.0 (and every XPath is hosted!) one can use more understandable expressions:
string((/Foo/Baz | $vDefaults[not(/Foo/Baz/text())]/Foo/Baz)[last())
Here the variable $vDefaults is a separate document that has the same structure as the primary XML document, and whose text nodes contain default values.
Or, if XSLT is the hosting language, one can use the document() function:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my">
<xsl:output method="text"/>
<my:defaults>
<Foo>
<Bar/>
<Baz>not-found</Baz>
</Foo>
</my:defaults>
<xsl:template match="/">
<xsl:value-of select=
"concat(/Foo/Baz,
document('')[not(current()/Foo/Baz/text())]
/*/my:defaults/Foo/Baz
)"/>
</xsl:template>
</xsl:stylesheet>
Or, not using concat():
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my">
<xsl:output method="text"/>
<my:defaults>
<Foo>
<Bar/>
<Baz>not-found</Baz>
</Foo>
</my:defaults>
<xsl:variable name="vDefaults" select="document('')/*/my:defaults"/>
<xsl:template match="/">
<xsl:value-of select=
"(/Foo/Baz
| $vDefaults/Foo/Baz[not(current()/Foo/Baz/text())]
)
[last()]"/>
</xsl:template>
</xsl:stylesheet>
/Foo/(Baz/string(), 'not-found')[1]
If you are okay with printing an empty string instead of 'not-found' message then use:
/Foo/concat(Baz/text(), '')
Later, you can replace the empty strings with 'not-found'.

Resources