XPath to find text node that is a sibling of other nodes - xpath

Given the following fragment of html:
<fieldset>
<legend>My Legend</legend>
<p>Some text</p>
Text to capture
</fieldset>
Is there an xpath expression that will return only the 'Text to capture' text node?
Trying /fieldset/text() yields three nodes, not just the one I need.

Assuming what you want is the text node containing non whitespace text :
//fieldset/text()[normalize-space(.)]
If what you want is the last text node, then:
//fieldset/text()[last()]

I recommend you accept Steven D. Majewski's answer, but here is the explanation (text nodes highlighted with square brackets):
<fieldset>[
]<legend>My Legend</legend>[
]<p>Some text</p>[
Text to capture
]</fieldset>
so /fieldset/text() returns
"\n "
"\n "
"\n Text to capture\n"
And this is why you want /fieldset/text()[normalize-space()], and you want the result trimmed before use.
Also note that the above is short for /fieldset/text()[normalize-space(.) != '']. When normalize-space() returns a non-empty string, the predicate evaluates to true, while the empty string evaluates to false.

Related

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

XPath get texts inside a "might existed" tag

I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT"><span style='color: #EFFFFF'>text3</span></div>
<div id="SNT"><span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath?
i.e.:
text1
text2
text3
text4
Use:
//div[#id='SNT']//text()
This selects any text node that is a descendent of any div element in the XML document, that has an id attribute with string value the string "SNT".
If you want to excclude the whitespace-only text nodes from this selection, use:
//div[#id='SNT']//text()[normalize-space()]
This is similar to the first XPath expression, but now each selected text node must have an additional predicate satisfied -- that the value of the normalize-space() function upon its string contents is a non-empty string.
The value of the normalize-space() function is the empty string only when its argument is the empty string itself, or a string comprised of whitespace-only characters (space, NL, CR and Tab).

XPath expression for selecting all text in a given node, and the text of its chldren

Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try

XPath find text in any text node

I am trying to find a certain text in any text node in a document, so far my statement looks like this:
doc.xpath("//text() = 'Alliance Consulting'") do |node|
...
end
This obviously does not work, can anyone suggest a better alternative?
This expression //text() = 'Alliance Consulting' evals to a boolean.
In case of this test sample:
<r>
<t>Alliance Consulting</t>
<s>
<p>Test string
<f>Alliance Consulting</f>
</p>
</s>
<z>
Alliance Consulting
<y>
Other string
</y>
</z>
</r>
It will return true of course.
Expression you need should evaluate to node-set, so use:
//text()[. = 'Alliance Consulting']
E.g. expression:
count(//text()[normalize-space() = 'Alliance Consulting'])
against the above document will return 3.
To select text nodes which contain 'Alliance Consulting' in the whole string value (e.g. 'Alliance Consulting provides great services') use:
//text()[contains(.,'Alliance Consulting')]
Do note that adjacent text nodes should become one after parser gets to the document.

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

Resources