XPath get texts inside a "might existed" tag - xpath

I have a HTML which contains some tags like below:
<div id="SNT">text1</div>
<div id="SNT">text2</div>
<div id="SNT"><span style='color: #EFFFFF'>text3</span></div>
<div id="SNT"><span style='color: #EFFFFF'>text4</span></div>
how can I get all the texts included in all <div> tags using XPath?
i.e.:
text1
text2
text3
text4

Use:
//div[#id='SNT']//text()
This selects any text node that is a descendent of any div element in the XML document, that has an id attribute with string value the string "SNT".
If you want to excclude the whitespace-only text nodes from this selection, use:
//div[#id='SNT']//text()[normalize-space()]
This is similar to the first XPath expression, but now each selected text node must have an additional predicate satisfied -- that the value of the normalize-space() function upon its string contents is a non-empty string.
The value of the normalize-space() function is the empty string only when its argument is the empty string itself, or a string comprised of whitespace-only characters (space, NL, CR and Tab).

Related

replace full string in xpath just get before

I am searching a solution to remove a string value obtained on a webpage with an XPath function.
I have this :
<div id="article_body" class="">
This my wonderful sentence, however here the string i dont want :
<br><br>
<div class="typo">Found a typo in the article? Click here.
</div>
</div>
So at the end I would have
This my wonderful sentence, however here the string i dont want :
I get the text with
//*[#id="article_body"]
Then I try to use replace:
//replace('*[#id="article_body"]','Found a typo in the article? ', )
But it doesn't work, so I think it's because I'm a newbie with XPath...
How can I do that please?
It appears that you are getting the computed string value of the selected div element.
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
If you don't want to include the text() from the descendant nodes, and only want the text() that are immediate children of the div, then adjust your XPath:
//*[#id="article_body"]/text()
Otherwise, you could use substring-before():
substring-before(//*[#id="article_body"], 'Found a typo in the article?')

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

XPath expression: selecting text nodes between element nodes

Based in the following HTML I want to extract TextA, TextC and TextE.
<div id='content'>
TextA
<br/>
<br/>
<p>TextB</p>
TextC
<br/>
TextC
<p>TextD</p>
TextE
</div>
I tried to get TextC like so but I don't get the result I want:
Query:
//*[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]
Expected result:
["TextC", <br/>, "TextC"]
Actual result:
[<br/>]
Is there a way to select the text nodes without using indexes like //div/text()[1]?
The reason why the two text nodes aren't in the result of your XPath is because * only match elements. To match both element and text node you can use node() instead :
//node()[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]
Demo
Or if you want to get the text nodes only i.e excluding <br/>, you can use text() instead of node():
//text()[preceding::p[contains(.,"TextB")] and following::p[contains(.,"TextD")]]

XPath to find text node that is a sibling of other nodes

Given the following fragment of html:
<fieldset>
<legend>My Legend</legend>
<p>Some text</p>
Text to capture
</fieldset>
Is there an xpath expression that will return only the 'Text to capture' text node?
Trying /fieldset/text() yields three nodes, not just the one I need.
Assuming what you want is the text node containing non whitespace text :
//fieldset/text()[normalize-space(.)]
If what you want is the last text node, then:
//fieldset/text()[last()]
I recommend you accept Steven D. Majewski's answer, but here is the explanation (text nodes highlighted with square brackets):
<fieldset>[
]<legend>My Legend</legend>[
]<p>Some text</p>[
Text to capture
]</fieldset>
so /fieldset/text() returns
"\n "
"\n "
"\n Text to capture\n"
And this is why you want /fieldset/text()[normalize-space()], and you want the result trimmed before use.
Also note that the above is short for /fieldset/text()[normalize-space(.) != '']. When normalize-space() returns a non-empty string, the predicate evaluates to true, while the empty string evaluates to false.

select parent node containing text inside children's node

basically i want to select a node (div) in which it's children node's(h1,b,h3) contain specified text.
<html>
<div id="contents">
<p>
<h1> Child text 1</h1>
<b> Child text 2 </b>
...
</p>
<h3> Child text 3 </h3>
</div>
i am expecting, /html/div/ not /html/div/h1
i have this below, but unfortunately returns the children, instead of the xpath to the div.
expression = "//div[contains(text(), 'Child text 1')]"
doc.xpath(expression)
i am expecting, /html/div/ not /html/div/h1
So is there a way to do this simply with xpath syntax?
The following expression gives a node (div) in which any children nodes (not just h1,b,h3) contain specified text (not the div itself):
doc.xpath('//div[.//*[contains(text(), "Child text 1")]]')
you can refine that and return the only the div with the id contents like in your example:
doc.xpath('//div[#id="contents" and .//*[contains(text(), "Child text 1")]]')
It does not match, if the text is a text node of the div (directly inside the div), which is my interpretation of the question.
You could append "/.." to anchor back to the parent. Not sure if there's a more robust method.
expression = "//div[contains(text(), 'Child text 1')]/.."

Resources