How get whole text under h1 tag to the next h1 tag?
I have class name of starting h1 tag
...
<h1 class="something">...</h1>
...
<h1 ...>...</h1>
...
I tried: //*[#class='something']//text()
I want to scrapy text from all childs and siblings. I don't need text of h1 tags. I don't know how to stop scraping to next h1 tag.
With a proper example:
<root>
<h1 class="something">.1.</h1>
.2.
<p>.3.</p>
.4.
<h1 class="other">.5.</h1>
</root>
This XPath 1.0 expression:
/root//text()[not(ancestor::h1)][preceding::h1[1][#class='something']]
Meaning: "descendants text nodes of root element having the first preceding h1 element with #class attribute equal to 'something´ and not having an ancestor h1 element"
And it selects
.2.
.3.
.4.
Test in http://www.xpathtester.com/xpath/ecd4f379b13558572ffd62d0db3a3f98
Related
I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p
Given html:
<div class="class1">
<div class="class2">1 2 3</div>
<div class="class3">a b c</div>
</div>
As i have several div elements in my html which uses the class "class1" and none has a id i want to find/fetch this parent element by the text of its children.
I tried different variants like
By.xpath("//div[contains(#class, 'class1') "
+ "and text()[contains(.,'1 2 3')] "
+ "and text()[contains(.,'a b c')]]"));
but nothing seems to work yet.
In the example above i guess the text of the class1 element is checked but not of its children.
Can anybody help?
So you're looking for a div with class class1 that has children with texts 1 2 3 and a b c. From your example of what you've tried, I'm assuming there are no further conditions (eg class) on the children:
//div[#class='class1' and div/text()='1 2 3' and div/text()='a b c']
You can make those children node names into * if you don't care whether they are divs or not. You can make the children node names prefixed by descendant:: if you don't require them to be direct children.
Try any of these below mentioned xpath.
Using class attribute of <div> tag.
//div[#class='class2']/..//div[#class='class3']/..//parent::div[#class='class1']
Explanation of xpath: First locate both child elements using the class attribute of <div> tag and then move ahead with parent keyword with <div> tag along with class attribute.
OR
Using text method along with <div> tag.
//div[text()= '1 2 3']/..//div[text()= 'a b c']/..//parent::div[#class='class1']
Explanation of xpath: First locate both child elements using the text method of <div> tag and then move ahead with parent keyword with <div> tag along with class attribute.
These above xpath will locate your parent element <div class="class1">
Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.
Consider this HTML:
<html>
<head>
</head>
<body>
<table>
<tr>
<td>
<h1>title</h1>
<h3>item 1</h3>
text details for item 1
<h3>item 2</h3>
text details for item 2
<h3>item 3</h3>
text details for item 3
</td>
</tr>
</table>
</body>
</html>
I'm not terribly familiar with XPath, but it seems to me that there is no notation which will match the "text details" sections individually. Can you confirm?
Use:
/html/body/table/tr/td/h3/following-sibling::text()[1]
This means: Get the first following sibling text node of every h3 element that is a child of every tr element that is a child of every table element that is a child of every body element that is a child of the html top element.
Or, if you only know that the wanted text nodes are the immediate following siblings of all h3 elements in the docunent, then tis XPath expression selects them:
//h3/following-sibling::text()[1]
in the world of Xml/Xpath
Text - is a type of Element Node.
so considering your example
TD has 7 child nodes
TD.getChild(3) should return the "text details for item 1" Value.
in XPath
$x//table/tr/td/text()[1]
basically i want to select a node (div) in which it's children node's(h1,b,h3) contain specified text.
<html>
<div id="contents">
<p>
<h1> Child text 1</h1>
<b> Child text 2 </b>
...
</p>
<h3> Child text 3 </h3>
</div>
i am expecting, /html/div/ not /html/div/h1
i have this below, but unfortunately returns the children, instead of the xpath to the div.
expression = "//div[contains(text(), 'Child text 1')]"
doc.xpath(expression)
i am expecting, /html/div/ not /html/div/h1
So is there a way to do this simply with xpath syntax?
The following expression gives a node (div) in which any children nodes (not just h1,b,h3) contain specified text (not the div itself):
doc.xpath('//div[.//*[contains(text(), "Child text 1")]]')
you can refine that and return the only the div with the id contents like in your example:
doc.xpath('//div[#id="contents" and .//*[contains(text(), "Child text 1")]]')
It does not match, if the text is a text node of the div (directly inside the div), which is my interpretation of the question.
You could append "/.." to anchor back to the parent. Not sure if there's a more robust method.
expression = "//div[contains(text(), 'Child text 1')]/.."