How to select the inner most node which contains an email? - xpath

From this sample HTML
<html>
<title>Our site</title>
<body bgcolor="#333366" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0">
<div id="Layer2" style="position:absolute; width:106px; height:134px; z-index:2; left: 20px; top: 340px;" class="info">info#systems.ca</div>
</body>
</html>
I want to use XPATH to get my the most inner node that contains the email
I tried this:
/*[contains(.,'#')]
But it selects the 'HTML' node. The name of the node should be anything (I know the '#' is a very week selection but I will then use regex to make sure the node contains an email).
EDIT
In this case I want 'DIV'

You can do this by selecting text nodes instead of *, then getting their parent nodes. The XPath expression would be:
//text()[contains(.,'#')]/..
This returns a collection of tags that contain text nodes, out of which at least one has an email address.

Probably not the most efficient, but try:
//*[contains(.,'#') and not(descendant::*[contains(.,'#')])]
or
(//*[contains(.,'#')])[last()]

Related

XPath node that doesn't contain a child

I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p

XPath selector for tag with specific descendant tags selects other tags

Given a document:
<html>
<body>
<div>
<div>No span</div>
<span>Target</span>
</div>
</body>
</html>
I would like to select the <div> containing the <span>. However, when I use this selector:
//div[//span]
It matches both <div>s:
<div><div>No span</div><span>Target</span></div> <-- what I wanted
<div>No span</div> <-- this is also matched
I tested this on Google Chrome's Devtools, as well as several online XPath evaluators, so I assume this is the correct behavior.
Why is this happening, and how can I fix my selector?
select the <div> containing the <span>
Use relative paths.
//div[.//span]
// starts from the document root. .// starts from the context element.
Predicates evaluate to true when the contained expression selects nodes. This means that //div[//span] is always true when there is a <span> anywhere in the document, in which case all <div>s in the document will be selected. //div[.//span] is only true when there is a <span> anywhere in the respective <div>.
If you mean "has a <span> child" (as opposed to "has a <span> descendant") this will work:
//div[span]
which is a shorthand for this (to underline the difference between / and //):
//div[./span]

Can i write short path in XPath?

<html>
<body>
Example
SO
<div>
<div class="kekeke">JSAFK</div>
</div>
</body>
</html>
For getting a JSAFK element in this doc, using XPath, can I just write //*div[#class=kekeke] instead full XPath?
// is short for /descendant-or-self::node()/. So...
This XPath,
//div[#class='kekeke']
will select all such div elements in the document:
<div class="kekeke">JSAFK</div>
This XPath,
//div[#class='kekeke']/text()
will select all text nodes under all such div elements in the document:
JSAFK
there is something wrong in "
//*div[#class=kekeke]
you can't use * and div together. if you want to have a shorter path.
you can write like this
//div[#class="kekeke"]/text()

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

Resources