XPath expression to capture all nested text under a specific root - xpath

I have some HTML from which I want to extract text content using Python + lxml
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
A link I DO want
</span>
</div>
</body>
</html>
Couple of conditions -
I only want text nested under a specific root div[#class='container']
I want all nested text under that root
So -
if __name__=="__main__":
import lxml.html
doc=lxml.html.fromstring(HTML)
root=doc.xpath("//div[#class='container']").pop()
for xpath in ["p|a",
"//p|//a"]:
print ("%s -> %s" % (xpath,
"; ".join([el.text_content()
for el in root.xpath(xpath)])))
then -
$ python xpath_test.py
p|a -> Some text I DO want
//p|//a -> Some text I DON'T want; Some text I DO want; A link I DO want
So p|a captures too little (doesn't capture the nested link) whilst //p|//a captures too much (tags I don't want)
What xpath expression will return only Some text I DO want; A link I DO want ?

With the following XPath (all texts descendants from the specified div excluding whitespace nodes) :
//div[#class="container"]//text()[normalize-space()]
Piece of code :
data = """HTML
<html>
<body>
<p>Some text I DON'T want</p>
<div class="container">
<p>Some text I DO want</p>
<span>
A link I DO want
</span>
</div>
</body>
</html>
HTML"""
import lxml.html
tree = lxml.html.fromstring(data)
print (tree.xpath('//div[#class="container"]//text()[normalize-space()]'))
Output :
['Some text I DO want', 'A link I DO want']

Related

Auto generate XPath for known element in HTML tree using python

Is there any way (libs, not manually) for generating relative XPath for a known element in HTML?
Let say the second P element inside class="content"
<html>
<body>
<div class"title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class"content">
<p>****</p>
<p>****</p>
</div>
</body>
</html>
Use case:
The idea is to guess where are the elements that I might be interested in. For example title, content or author. After I've found the element I want to generate xpath for it and later use Python3.
Try something like this:
from lxml import etree
datum = """
<html>
<body>
<div class="title">
<h1>***</h1>
</div>
<p> *** </p>
<h3>***</h3>
<div class="content">
<p>something</p>
<p>target</p>
</div>
</body>
</html>
"""
root = etree.fromstring(datum)
tree = etree.ElementTree(root)
find_text = etree.XPath("//p[text()='target']")
for target in find_text(root):
print(tree.getpath(target))
Output:
/html/body/div[2]/p[2]

Xpath for any html element containing specific text inside html tag

Need to find xpath that matches any html tag that contains the word sidebar in any html tag. Example:
<p class='my class'>This is some text</p>
<h1 class='btn sidebar btn-now'><p>We have more text here</p><p> and anoter text here</p></div>
<div id='something here'>New text here</div>
<div id='something sidebar here'>New text again</div>
<nav class='this sidebar btn'>This is my nav</nav>
<sidebar><div>This is some text</div></sidebar>
I need xpath to get any html element that has word 'sidebar' between starting < and ending > html tag, be it class, id or html tag name. In the above example I need to get as result:
<h1 class='btn sidebar btn-now'><p>We have more text here</p><p> and anoter text here</p></div>
<div id='something sidebar here'>New text again</div>
<nav class='this sidebar btn'>This is my nav</nav>
<sidebar><div>This is some text</div></sidebar>
Needs to be xpath not regex
Try below and let me know if it's not what you're searching for:
//*[contains(#*, "sidebar") or contains(name(), "sidebar")]
contains(#*, "sidebar") means node with any attribute that contains "sidebar"
contains(name(), "sidebar") - node name that contains "sidebar"
If you need only id or class to contain "sidebar":
//*[contains(#id, "sidebar") or contains(#class, "sidebar") or contains(name(), "sidebar")]

How to insert tags in html code with bash

I have the following html code:
<!DOCTYPE html>
<html>
...
<main>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
</div>
</div>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
</div>
</div>
...
</main>
<footer>
</footer>
</body>
</html>
I want to insert the tags in div called tag but only one time.
i tried to do that:
for tag in $tags
do
sed -i "/<div id=\"tags\">/a $tag" $publicDir/index.html
done
But that produces that all postLink have all tags.
I only want that a postlink have their tags not the others postLink. Like that:
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
Hellotag
Othertag
</div>
</div>
<div class=postLink>
Write your title here
<div id="description"> Write the description of the post. It will show in the index.</div>
<div id="datePost">19-06-2017</div>
<div id="tags">
Byetag
Xtag
</div>
</div>
Sorry my bad english.
I solved the problem from oter point of view.
What I do is created a variable called string with the html code "postlink" before where the tags have to be. And then a for loop append tags code and finally add divs ends.
So i have only one "postlink" with the tags, and tags dont affect the previous postlink that could be there before on html page.
string="<div class=postLink>\n\t $title\n <div id=\"description\">$description</div>\n<div id=\"datePost\">$date</div>\n<div id=\"tags\">\n"
for tag in $tags
do
string+="$tag\n"
done
string+="</div>\n</div>"
sed -i "/<main>/a $string" "$publicDir/index.html"
Thanks John Goofy and 123 for try to help me. :)
By using awk and passing the desired text to be printed within the tags div as a variable should achieve what you need. You could use a for loop to construct this variable with different files within the tags directory where required just ensure that you add new lines and tabs where required.
awk -v var="\n\t Xtag\n" '{ if ( $0 ~ "tags" && done != 1 ) { print $0var;done=1 } else { printf $0"\n" } }' html
Where html is the file in question, we pass the text to be added as a variable called var and then where we find a line that has tags within it (you can "sure up" the reg expression where necessary here) we print the line and the additional text. The variable done is then set to one to ensure that the if statement fails on the next div tag and the text isn't printed.

XPath extract text inside tag

HTML structure looks like this:
<div class="Parent">
<div id="A">more tags and text</div>
<div id="B">more tags and text</div>
more tags
<p> and text </p>
</div>
I would like to extract text just from the parent and the tags apart from the A and B children.
I have tried
/div[#class='Parent']//text()
which extracts text from all the descendant nodes, so a made a constraint like /div[#class='Parent']//text()[not(self::div)]
but it did not change a thing.
Thanks for any advice
/div[#class='Parent']/*[not(self::div and (#id='A' or #id='B'))]//text() | /div[#class='Parent']/text()

grabbing text between two elements in nokogiri?

<body>
<div>some text</div>
I NEED THIS TEXT ONLY
<div>some text</div>
more text here
<div>some text</div>
one more text here
<div>some text</div>
</body>
How?
Use:
/*/div[1]/following-sibling::text()[1]
This selects the first text-node sibling of the first div child of the top element of the document.
this returns the first text node within body between two div elements:
/body/text()[
./preceding::element()[1][local-name()="div"] and
./following::element()[1][local-name()="div"]
][1]
should return
I NEED THIS TEXT ONLY
This XPath 1.0:
/body/text()[preceding-sibling::*[1][self::div]]
[following-sibling::*[1][self::div]][1]
Also:
/body/text()[normalize-space()][1]
I don't have nokogiri, but here's an alternative using just basic string manipulation.
html=<<EOF
<body>
<div>some text</div>
I NEED THIS TEXT ONLY
<div>some text</div>
more text here
<div>some text</div>
one more text here
<div>some text</div>
</body>
EOF
p html.split(/<\/*body>/)[1].split(/<\/div>/)[1].split(/<div>/)[0]

Resources