I am searching a solution to remove a string value obtained on a webpage with an XPath function.
I have this :
<div id="article_body" class="">
This my wonderful sentence, however here the string i dont want :
<br><br>
<div class="typo">Found a typo in the article? Click here.
</div>
</div>
So at the end I would have
This my wonderful sentence, however here the string i dont want :
I get the text with
//*[#id="article_body"]
Then I try to use replace:
//replace('*[#id="article_body"]','Found a typo in the article? ', )
But it doesn't work, so I think it's because I'm a newbie with XPath...
How can I do that please?
It appears that you are getting the computed string value of the selected div element.
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
If you don't want to include the text() from the descendant nodes, and only want the text() that are immediate children of the div, then adjust your XPath:
//*[#id="article_body"]/text()
Otherwise, you could use substring-before():
substring-before(//*[#id="article_body"], 'Found a typo in the article?')
Related
I have the following code :
<div class = "content">
<table id="detailsTable">...</table>
<div class = "desc">
<p>Some text</p>
</div>
<p>Another text<p>
</div>
I want to select all the text within the 'content' class, which I would get using this xPath :
doc.xpath('string(//div[#class="content"])')
The problem is that it selects all the text including text within the 'table' tag. I need to exclude the 'table' from the xPath. How would I achieve that?
XPath 1.0 solutions :
substring-after(string(//div[#class="content"]),string(//div[#class="content"]/table))
Or just use concat :
concat(//table/following::p[1]," ",//table/following::p[2])
The XPath expression //div[#class="content"] selects the div element - nothing more and nothing less - and applying the string() function gives you the string value of the element, which is the concatenation of all its descendant text nodes.
Getting all the text except for that containing in one particular child is probably not possible in XPath 1.0. With XPath 2.0 it can be done as
string-join(//div[#class="content"]/(node() except table)//text(), '')
But for this kind of manipulation, you're really in the realm of transformation rather than pure selection, so you're stretching the limits of what XPath is designed for.
I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']
Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])
I need to retrieve a div without children with given text. I have this html
<h1>Rest Object</h1>
<div style="background-color: transparent;">
<div>Title: Rest object</div>
<div>ID: 2</div>
<div>Title: Rest object Copy</div>
<div>Full text: This is the full text. ID: 2</div>
<div>Value: 0.564</div>
<div>Timestamp: 2017-06-14 11:35:40</div>
</div>
I want to find <div>ID: 2</div>. How? I tried
xpath=(//div)
and it returns first div. I tried to use
xpath=(//div[not(div)])
and it returns
<div>Title: Rest object</div>.
UPDATE. Now I know I could you index.
xpath=(//div[not(div)][2])
<div>ID: 2</div>.
What if I don't know the index.
which returns
One way to get the needed div is to use starts-with() function:
//div[starts-with(.,'ID:')]
boolean starts-with(string, string) - Returns true if the first argument string starts with the second argument string; otherwise returns false
To restrict the search to div element which has no children you may use count(*) function:
//div[starts-with(.,'ID:')][count(*)=0]
I have element that looks like this:
<div class="unique class">
<i class="generic class">text1</i>
text2
</div>
Is there a good way to select text2 only? I may add that text2 always starts with "Following".
This is one possible XPath :
//div[#class='unique class']/text()[starts-with(normalize-space(),'Following')]
brief explanation :
//div[#class='unique class'] : find the outer div by its class
/text()[starts-with(normalize-space(),'Following')] : find text node that is direct child of the previously found div, where, after normalizing spaces, starts with text "Following".
Another alternative which doesn't consider target text node's content would be :
//div[#class='unique class']/text()[normalize-space()]
The last bit (/text()[normalize-space()]) returns non-empty text nodes, that is direct child of the outer div.
You can simply use text() on the parent div
//div[#class='unique class']/text()[2]