scrapy and xpath: get the text in a child element, if the parent element contains text - xpath

How do i get the text of a child element, if the parent element contains text with a specific string?
For example:
<li>
"string1"
<span>
"Hello"
</span>
</li>
<li>
"string2"
<span>
"Ola"
</span>
</li>
From the above html code, how to get only string "Ola" using xpath?

Without knowing scrapy, I would try
//li[text()[contains(.,"string2")]]/span/text()
//li[text()[contains(.,"string2")]] select a li element that text contains string2
/span select a element span below the selected li
/text(): return the text of the selected span element
Update: This is simpler and should also work:
//li[contains(text(),"string2")]/span/text()

Related

XPath node that doesn't contain a child

I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Xpath identifier for an span element with an 'i' tag inside it

Trying to find out xpath for the element for the html code mentioned below.
Xpath works with
//span[#title='Open']
But not with
//span[text()='Open']
Trying to find out xpath with span and text. How this could be done
<span class="m-t-5">
<span class="label statusopen" title="Open" onclick="javascript:toggleHistoryTab(this,'tw_831158485664530432','2302','SOLR','-1','s360-379359269');" style="cursor: pointer; text-decoration: none;">
<i class="fa fa-envelope-open-o"/>
Open
</span>
</span>
In my test case, more appropriate one is to find with text "Open".
This XPath,
//span[normalize-space()='Open'][not(.//span)]
will select those span elements whose normalized string value is "Open" and will exclude parent span elements such as the one in your example with class="m-t-5".
If you need XPath for span element with an 'i' tag inside it you may try:
//span[i]
If you need even more specific XPath for span element with an 'i' tag inside it that contains text "Open":
//span[i[normalize-space(text())="Open"]]

xpath for locating li with text does not work

Using the xpath //ul//li[contains(text(),"outer")] to find a li in the outer ul does not work
<ul>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 1
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 2
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
</ul>
Any idea how to find a li with a specific text in the outer ul?
Thank you
This will work for you //ul//li[contains(.,"outer")]
I would expect that you only like to consider the text nodes which are direct child of the li. Therefore you are right with using text() (if you use contains(.,"outer") this will consider text form any children of li).
Therefore try this:
//ul/li[text()[contains(.,'outer')]]
Running this with Saxon, the original XPath expression gives:
XPTY0004: A sequence of more than one item is not allowed as the first argument of
contains() ("", "", ...)
Now, I guess Selenium is probably using XPath 1.0 rather than XPath 2.0, and in 1.0 the contains() function has "first item semantics" - it converts its argument to a string, which if the argument is a node-set containing more than one node, involves considering only the first node. And the first text node is probably whitespace.
If you want to test whether some child text node contains "outer", use
//ul//li[text()[contains(.,"outer")]]
Another reason for switching to XPath 2.0...
For above issue -
This solution will work
//ul//li[contains(.,"outer")]
"." Selects the current node

How to extract inner text of multiple Paragraph tags which are nested withing an anchor tag

Here is the code:
<a id='Letter1'>
<p>Dear Sir, </p>
<p>This is with.........</p>
<p>I would be.......</p>
<p>Hoping to hear from you soon</p>
<p>Regards.</p>
</a>
Using Xpath I want to extract the inner text of all the Paragraph tags which are contained inside the anchor tag as a single text entity.
The final result i want is
string letterBody= document.DocumentNode.SelectSingleNode("//XPATH QUERY").innerText;
where letterBody="Dear Sir, This is with...................Regards."
You need to just get the <a> element and you will get all the text nodes which are under <a> as its innertext.
So your xpath would be /a[#id='Letter1'] or just /a.

Resources