I've seen similar questions, but the solutions I've seen won't work on the following. I'm far from an XPath expert. I just need to parse some HTML. How can I select the table that follows Header 2. I thought my solution below should work, but apparently not. Can anyone help me out here?
content = """<div>
<p><b>Header 1</b></p>
<p><b>Header 2</b><br></p>
<table>
<tr>
<td>Something</td>
</tr>
</table>
</div>
"""
from lxml import etree
tree = etree.HTML(content)
tree.xpath("//table/following::p/b[text()='Header 2']")
Some alternatives to #Arup's answer:
tree.xpath("//p[b='Header 2']/following-sibling::table[1]")
select the first table sibling following the p containing the b header containing "Header 2"
tree.xpath("//b[.='Header 2']/following::table[1]")
select the first table in document order after the b containing "Header 2"
See XPath 1.0 specifications for details on the different axes:
the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes
the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty
You need to use the below XPATH 1.0 using the Axes preceding.
//table[preceding::p[1]/b[.='Header 2']]
Related
I have structure that looks something like this
<p>
<br>
<b>Text to fetch </b>
<br>
"Some random text"
<b>Text not to fetch</b>
I need XPath that will allow me to fetch following sibling of the br element only if there is no text between br element and his following sibling.
If I do something like this
//br/following-sibling::b/text()[1]
It will fetch both Text to fetch and Text not to fetch, while I only need Text to fetch.
Another possible XPath :
//br/following-sibling::node()[normalize-space()][1][self::b]/text()
brief explanation:
//br/following-sibling::node(): find all nodes that is following-sibling of br element, where the nodes are..
[normalize-space()]: not empty (whitespace only), then..
[1]: for each br found, take only the first of such node, then..
[self::b]: check if the node is a b element, then if it is a b element..
/text(): return text node that is child of the b element
Try below XPath to avoid matching b nodes with preceding sibling text:
//br/following-sibling::b[not(preceding-sibling::text()[1][normalize-space()])]/text()
I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt
I'm using HtmlAgilityPack to get a filtered DOM of <h2> and <h3> nodes and using Xpath 1.0 (from my Xpath 1.0 crash course this week) I need to get the number of <h3>'s (the number varies) that are between sibling <h2>'s as follows:
<div>
<h2>heading 1</h2>
<h3>sub 1.1</h3>
<h3>sub 1.2</h3>
<h2>heading 2</h2>
<h3>sub 2.1</h3>
<h2>heading 3</h2>
....
</div>
When I iterate (using C#) through the filtered nodes I want the exact number of <h3>'s that are after a <h2> and before the next <h2>. When I use the following I get all the <h3>'s as the result.
int countH3 = n.SelectNodes("./preceding-sibling::h2[2]/following-sibling::h2[3]/preceding-sibling::h3").Count(); //the [position] is set dynamically
For the node structure above would like the result of the code line to be:
countH3 = 1
but it is:
countH3 = 3
I've found many similar SO questions regarding "sibling nodes between sibling nodes" and have to thank #LarsH for his comment in another question that /preceding::h3 returns ALL <h3>'s which helped explain the issue. I think I may need to use the Kayessian method of node-set intersection but get the "invalid token" error when I include the . | union character as follows:
countH3 = n.SelectNodes("./h2[2]/following-sibling::h2[3]
[count(.|./h2[2]/following-sibling::h2[3]/preceding-sibling::h3)=
count(./h2[2]/following-sibling::h2[3]/preceding-sibling::h3)]").Count();
Any suggestions appreciated.
I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"
I got html code like the following:
<p style="margin:0 0 0.5em 0;"><b>Blablub</b></p>
<table> ... </table>
Now I want to query the content of the <b> right above the table but only if the table does not have any attributes. I tried the following query:
//table[not(#*)]/preceding-sibling::p/b
If I remove the preceding-sibling::p/b part entirely it works. It gives me exactly the tables I need. However, if I use this query it gives me content of an <b> tag which precedes a table WITH attributes.
Use:
//table[not(#*)]/preceding-sibling::*[1][self::p]/b
This means: Select all b elements that are children of all p elements that are the first preceding sibling of a table that has no attributes.
This is quite different from the problematic expression cited in the question:
//table[not(#*)]/preceding-sibling::p[1]/b
The latter selects the b children of the first p following sibling -- there is no guarantee that the first p following sibling is also the first element sibling.