Get a string sequence from a node-set in XPath 1.0 - xpath

I'm using XPath 1.0 to parse an HTML file and I want to get a string sequence from a node-set. First I select a node-set (eg: //div) and then I want the string-value of each node of the set. I've tried with string(//div) but it only returns the string-value of the first node in the set.
Example:
<foo>
<div>
bbbb<p>aaa</p>
</div>
<div>
cccc<p>aaa</p>
</div>
</foo>
I expect a result like ('bbbbaaa', 'ccccaaa') but I only get 'bbbaaa'

In XPath 1.0 the "string-value of a node-set" is by definition the string value of the first node in the node-set.
In XPath 2.0 the following expression produces a sequence of the string values of all div elements in an XML document:
//div/string(.)

Related

Xpath expression (nokogiri) to get tag's child element?

From my xml, I can get this :
<home>
<creditors>
<count>2</count>
</creditors>
</home>
OR even this :
<home>
<creditors>
<moreThan>2</moreThan>
</creditors>
</home>
Which xpath expression can I use to get "<count>2</count>" instead of getting only "2" OR to get "<moreThan>2</moreThan>" instead of getting "2" ?
This XPath,
//creditors/count
will select all count child elements of all creditors elements in the XML document.
Update per OP's request in comments for a single XPath that selects both count and moreThan elements:
This XPath,
//creditors/*[self::count or self::moreThan]
will select all count or moreThan child elements of all creditors elements in the XML document.
Assuming that your xpath expression is OK, you just need to convert the element to string:
doc.xpath("home/creditors/*").to_s
=> "<count>2</count>"
Please check with queries returning more than one element, to make sure that it's desired behaviour.

Predicates: how is the expression nodeName='text' evaluated?

In this xpath:
/A/B[C='hello']
Is C="hello" some kind of syntactic shortcut for C[text()='hello']? Is it documented anywhere?
Edit: Okay, I discovered one difference: C= returns all the text nodes in C and C's children, while C[text()= returns only the text nodes in C.
Now, suppose I have the XML:
<root>
<A>
<B>
<C>hello<E>EEE</E>world</C>
<D>world</D>
</B>
<B>
<C>goodbye</C>
<D>mars</D>
</B>
</A>
</root>
How would I choose the B node containing the first C node using the syntax C[text()=? I can get the B node using the C= syntax like this:
/root/A/B[C="helloEEEworld"]
But this doesn't work:
/root/A/B[C[text()="helloworld"]]
nor do these:
/root/A/B[C[text()="hello world"]]
/root/A/B[C[text()="helloEEEworld"]]
Hmmm...this works:
/root/A/B[C[text()="hello"]]
Why is that? Does text() only return the first text node? According to the W3C, text() returns all text node children of the context node.
text() really returns all text node children as list of nodes
When you use /root/A/B[C[text()="hello"]] you mean fetch B node with C child that any direct child node is equal to "hello".
In the same way you can match it by :
/root/A/B[C[text()="world"]]
or explicitly specify that you want to get node by exact first or second direct child text node:
/root/A/B[C[text()[1]="hello"]]
/root/A/B[C[text()[2]="world"]]
If you want to match required node by its complete text content you can use
/root/A/B[C[.="helloEEEworld"]]
or
/root/A/B[C="helloEEEworld"]
C in the predicate expression [C='hello'] returns all C elements that is direct child of context element which is B. So the entire predicate is a boolean expression that contains comparison between a node-set and a string (notice that element is a type of node in XPath data model), and behavior of this case is documented in the spec as follows :
If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true. If one object to be compared is a node-set and the other is a boolean, then the comparison will be true if and only if the result of performing the comparison on the boolean and on the result of converting the node-set to a boolean using the boolean function is true. [source]
C='hello' in /A/B[C='hello'] will be evaluated to true if any of the C elements, after converted to string, equals 'hello'. So it is more of a shortcut for C[string()='hello'] if you will.
"Hmmm...this works:
/root/A/B[C[text()="hello"]]
Why is that? Does text() only return the first text node? According to the W3C, text() returns all text node children of the context node."
Instead of the first text node, text() in this context returns all direct child text nodes. This is because child:: is the default axis in XPath. Contrasts your XPath with the equivalent verbose version of it :
/child::root/child::A/child::B[child::C[child::text()="hello"]]

XPath expression: selecting text nodes between element nodes

Based in the following HTML I want to extract TextA, TextC and TextE.
<div id='content'>
TextA
<br/>
<br/>
<p>TextB</p>
TextC
<br/>
TextC
<p>TextD</p>
TextE
</div>
I tried to get TextC like so but I don't get the result I want:
Query:
//*[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]
Expected result:
["TextC", <br/>, "TextC"]
Actual result:
[<br/>]
Is there a way to select the text nodes without using indexes like //div/text()[1]?
The reason why the two text nodes aren't in the result of your XPath is because * only match elements. To match both element and text node you can use node() instead :
//node()[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]
Demo
Or if you want to get the text nodes only i.e excluding <br/>, you can use text() instead of node():
//text()[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]

XPath get texts inside a "might existed" tag

I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT"><span style='color: #EFFFFF'>text3</span></div>
<div id="SNT"><span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath?
i.e.:
text1
text2
text3
text4
Use:
//div[#id='SNT']//text()
This selects any text node that is a descendent of any div element in the XML document, that has an id attribute with string value the string "SNT".
If you want to excclude the whitespace-only text nodes from this selection, use:
//div[#id='SNT']//text()[normalize-space()]
This is similar to the first XPath expression, but now each selected text node must have an additional predicate satisfied -- that the value of the normalize-space() function upon its string contents is a non-empty string.
The value of the normalize-space() function is the empty string only when its argument is the empty string itself, or a string comprised of whitespace-only characters (space, NL, CR and Tab).

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

Resources