Extracting text in between nodes through XPath - xpath

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[#class='header' and contains(text(),'First item')] and following::*[#class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.

Found it!
//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[#class='header'][1][contains(text(),'First item')]]]
Indeed your solution, Aleh, won't work for tags inside the text.
Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?

//*[#class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[#class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used #Michiel part. Looks like omg but works: //div[#class='textfield'][1]//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[#class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)

For the sake of completeness, the final query, composed of various suggestions throughout the thread:
//*[
#class='textfield' and position() = 1
]
//text() [
preceding::*[
#class='header' and contains(text(),'First item')
]
][
following::*[
preceding::*[
#class='header'
][1][
contains(text(),'First item')
]
]
]

Related

Please help to extract date using xpath

<div class='postbodytop">
<a class="xxxxxxxxxxxxxxxx" href="xxxxxxxxxxxxxx">tonyd</a>
"posted this 4 minutes ago "
<span class="hidden-xs"> </span>
</div>
Hello, I want to extract the "posted this 4 minutes ago" or just "4 minutes" using xpath. Can anybody help me? Thank you
The div whose class equals postbodytop contains three child nodes: a span, a text node, and another span. Your path should start at the div and then select the child text node, for which the appropriate test is text().
div/text()
Of course this is just a fragment of a bigger page, and your XPath may need to have something at the start e.g. /html/body/ etc. and if there are other div elements at the same level as the <div class=postbodytop>, then you should be more specific about the div, e.g. div[#class="postbodytop"] instead of just div in that XPath expression.

How to get descendants with a specific tag name and text in protractor?

I have the following structure (it's just for sample). In protractor, I am getting the top element by id. However, the other elements do not have id's. I need to get the "label" element that contains the text '20'. Is there an easy way in protractor to select the element with a specific tag that contains a specific text from all the descendants of a parent element?
<pc-selector _... id="Number1">
<div ...></div>
<div ...>
<div ...>
<check-box _...>
<div _ngcontent-c25="" ...>
<label _ngcontent-c25="">
<input _ngcontent-c25="" type="checkbox">
<span _ngcontent-c25="" class="m-checkbox__marker"></span>
20 More text to follow</label>
</div>
</check-box>
</div>
</div>
</pc-selector>
I could't find anythitng, so I have tried with xpath, but protractor complains that my xpath is invalid:
parentElement = element(by.id('Number1'));
return parentElement.element(by.xpath(".//label[contains(text(),'20'))]"));
Any ideas?
You have an additional bracket in your [contains(text(),'20'))] which is likely causing you issue but there are multiple other ways this can be achieved using a single XPath or chaining other locators.
The process is that you must find the div with the correct id first and then locate the label that is a child of it.
//Xpath
element(by.xpath("//pc-selector[#id='Number1']//label[contains(text(),'20')]"));
//Chained CSS
element(by.id('Number1')).element(by.cssContainingText('label','20'));
You also may be interested to learn about xpath axes which can allow us to do very dynamic selection.
You can use the direct xpath to access the label.
element(by.xpath("//*[#id='Number1']//label"));

How to get the whole title which consists of several spans with XPATH?

How to get the whole title:
Iphone case :) #phonecases#xmas#iphone#case
When the title does not include hashtags I can get all the title with this xpath:
((//*[#class='pinWrapper'])[2]//span)[1]/text()
This line:
((//*[#class='pinWrapper'])[2]//span)[1]//text()[normalize-space()]
returns only the first one: Iphone case :).
And this:
((//*[#class='pinWrapper'])[2]//span)[1][string()]
returns whole xml:
<span>Iphone case :) <span class="pinHashtag">#phonecases</span> <span class="pinHashtag">#xmas</span> <span class="pinHashtag">#iphone</span> <span class="pinHashtag">#case</span></span>
If ((//*[#class='pinWrapper'])[2]//span)[1]/text() returns you first text node only, try
string(((//*[#class='pinWrapper'])[2]//span)[1])
to get complete string

Xpath get element above

suppose I have this structure:
<div class="a" attribute="foo">
<div class="b">
<span>Text Example</span>
</div>
</div>
In xpath, I would like to retrieve the value of the attribute "attribute" given I have the text inside: Text Example
If I use this xpath:
.//*[#class='a']//*[text()='Text Example']
It returns the element span, but I need the div.a, because I need to get the value of the attribute through Selenium WebDriver
Hey there are lot of ways by which you can figure it out.
So lets say Text Example is given, you can identify it using this text:-
//span[text()='Text Example']/../.. --> If you know its 2 level up
OR
//span[text()='Text Example']/ancestor::div[#class='a'] --> If you don't know how many level up this `div` is
Above 2 xpaths can be used if you only want to identify the element using Text Example, if you don't want to iterate through this text. There are simple ways to identify it directly:-
//div[#class='a']
From your question itself you have mentioned the answer for it
but I need the div.a,
try this
driver.findElement(By.cssSelector("div.a")).getAttribute("attribute");
use cssSelector for best result.
or else try the following xpath
//div[contains(#class, 'a')]
If you want attribute of div.a with it's descendant span which contains text something, try as below :-
driver.findElement(By.xpath("//div[#class = 'a' and descendant::span[text() = 'Text Example']]")).getAttribute("attribute");
Hope it helps..:)

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

Resources