<body>
<div>some text</div>
I NEED THIS TEXT ONLY
<div>some text</div>
more text here
<div>some text</div>
one more text here
<div>some text</div>
</body>
How?
Use:
/*/div[1]/following-sibling::text()[1]
This selects the first text-node sibling of the first div child of the top element of the document.
this returns the first text node within body between two div elements:
/body/text()[
./preceding::element()[1][local-name()="div"] and
./following::element()[1][local-name()="div"]
][1]
should return
I NEED THIS TEXT ONLY
This XPath 1.0:
/body/text()[preceding-sibling::*[1][self::div]]
[following-sibling::*[1][self::div]][1]
Also:
/body/text()[normalize-space()][1]
I don't have nokogiri, but here's an alternative using just basic string manipulation.
html=<<EOF
<body>
<div>some text</div>
I NEED THIS TEXT ONLY
<div>some text</div>
more text here
<div>some text</div>
one more text here
<div>some text</div>
</body>
EOF
p html.split(/<\/*body>/)[1].split(/<\/div>/)[1].split(/<div>/)[0]
Related
I have some HTML from which I want to extract text content using Python + lxml
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
A link I DO want
</span>
</div>
</body>
</html>
Couple of conditions -
I only want text nested under a specific root div[#class='container']
I want all nested text under that root
So -
if __name__=="__main__":
import lxml.html
doc=lxml.html.fromstring(HTML)
root=doc.xpath("//div[#class='container']").pop()
for xpath in ["p|a",
"//p|//a"]:
print ("%s -> %s" % (xpath,
"; ".join([el.text_content()
for el in root.xpath(xpath)])))
then -
$ python xpath_test.py
p|a -> Some text I DO want
//p|//a -> Some text I DON'T want; Some text I DO want; A link I DO want
So p|a captures too little (doesn't capture the nested link) whilst //p|//a captures too much (tags I don't want)
What xpath expression will return only Some text I DO want; A link I DO want ?
With the following XPath (all texts descendants from the specified div excluding whitespace nodes) :
//div[#class="container"]//text()[normalize-space()]
Piece of code :
data = """HTML
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
A link I DO want
</span>
</div>
</body>
</html>
HTML"""
import lxml.html
tree = lxml.html.fromstring(data)
print (tree.xpath('//div[#class="container"]//text()[normalize-space()]'))
Output :
['Some text I DO want', 'A link I DO want']
Need to find xpath that matches any html tag that contains the word sidebar in any html tag. Example:
<p class='my class'>This is some text</p>
<h1 class='btn sidebar btn-now'><p>We have more text here</p><p> and anoter text here</p></div>
<div id='something here'>New text here</div>
<div id='something sidebar here'>New text again</div>
<nav class='this sidebar btn'>This is my nav</nav>
<sidebar><div>This is some text</div></sidebar>
I need xpath to get any html element that has word 'sidebar' between starting < and ending > html tag, be it class, id or html tag name. In the above example I need to get as result:
<h1 class='btn sidebar btn-now'><p>We have more text here</p><p> and anoter text here</p></div>
<div id='something sidebar here'>New text again</div>
<nav class='this sidebar btn'>This is my nav</nav>
<sidebar><div>This is some text</div></sidebar>
Needs to be xpath not regex
Try below and let me know if it's not what you're searching for:
//*[contains(#*, "sidebar") or contains(name(), "sidebar")]
contains(#*, "sidebar") means node with any attribute that contains "sidebar"
contains(name(), "sidebar") - node name that contains "sidebar"
If you need only id or class to contain "sidebar":
//*[contains(#id, "sidebar") or contains(#class, "sidebar") or contains(name(), "sidebar")]
I have an HTML as so:
<html>
<body>
<div class="somethingunneccessary"></div>
<div class="container">
<div>
<p>text1</p>
<p>text2</p>
<p>text3</p>
</div>
<div>
<p>text4/p>
<p>text5</p>
<p>text6</p>
</div>
<div>
<p>text7</p>
<p>text8</p>
<p>text9</p>
</div>
<div>
<p>text10</p>
<p>text11</p>
<p>text12</p>
</div>
<div>
<p>text13</p>
<p>text14</p>
<p>text15</p>
</div>
</div>
</body>
</html>
What I'm trying to accomplish is the following:
1./ Loop over the div elements within the div having a class container.
2./ During the iteration I want to grab the text from the 3rd p tag.
The looping part is essential instead of just slicing out the p tags by themselves
I've got some code done but it doesn't do looping:
$doc=new DOMDocument();
$doc->loadHTML($htmlsource);
$xpath = new DOMXpath($doc);
$commentxpath = $xpath->query("/html/body/div[2]/div[5]/p[3]");
$commentdata = $commentxpath->item(0)->nodeValue;
How do I loop through each inner div element and extract the 3rd p tag.
Like I said, the looping is essential.
During the iteration I want to grab the text from the 3rd p tag
Try:
"//div[#class='container']/div/p[3]"
This should return all third p in all div inside of div with class container.
You may have to query over attributes: php xpath get attribute value
$xpath->query("/html/body/div[#class='container']");
Just try
/html/body/div/div//p
That should return only the p elements XD
HTML structure looks like this:
<div class="Parent">
<div id="A">more tags and text</div>
<div id="B">more tags and text</div>
more tags
<p> and text </p>
</div>
I would like to extract text just from the parent and the tags apart from the A and B children.
I have tried
/div[#class='Parent']//text()
which extracts text from all the descendant nodes, so a made a constraint like /div[#class='Parent']//text()[not(self::div)]
but it did not change a thing.
Thanks for any advice
/div[#class='Parent']/*[not(self::div and (#id='A' or #id='B'))]//text() | /div[#class='Parent']/text()
I have a page with content that looks similar to this:
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
My goal is to capture the text in #level2 but the #level3 <div> is nested inside of it at the same level as the text I want.
Is it possible to some how exclude that <div>? Should I be modifying the document and simply removing the element before parsing?
require 'nokogiri'
xml = <<-XML
<div id="level1">
<div id="level2">
<div id="level3">Crap i dont care about</div>
Here is some text i want
<br />
Here is some more text i want
<br />
Oh i want this text too :)
</div>
</div>
XML
page = Nokogiri::XML(xml)
p page.xpath("//*[#id='level3']").remove.xpath("//*[#id='level2']").inner_text
# => "\n \n Here is some text i want\n \n Here is some more text i want\n \n Oh i want this text too :)\n "
Now, you may clean the output text if you wish.
If your HTML fragment is in html, then you could do something like this:
doc = Nokogiri::HTML(html)
div = doc.at_css('#level2') # Extract <div id="level2">
div.at_css('#level3').remove # Remove <div id="level3">
text_you_want = div.inner_text
You could also do it with XPath but I find CSS selectors a bit simpler for simple cases like this.