xpath for locating li with text does not work - xpath

Using the xpath //ul//li[contains(text(),"outer")] to find a li in the outer ul does not work
<ul>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 1
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 2
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
</ul>
Any idea how to find a li with a specific text in the outer ul?
Thank you

This will work for you //ul//li[contains(.,"outer")]

I would expect that you only like to consider the text nodes which are direct child of the li. Therefore you are right with using text() (if you use contains(.,"outer") this will consider text form any children of li).
Therefore try this:
//ul/li[text()[contains(.,'outer')]]

Running this with Saxon, the original XPath expression gives:
XPTY0004: A sequence of more than one item is not allowed as the first argument of
contains() ("", "", ...)
Now, I guess Selenium is probably using XPath 1.0 rather than XPath 2.0, and in 1.0 the contains() function has "first item semantics" - it converts its argument to a string, which if the argument is a node-set containing more than one node, involves considering only the first node. And the first text node is probably whitespace.
If you want to test whether some child text node contains "outer", use
//ul//li[text()[contains(.,"outer")]]
Another reason for switching to XPath 2.0...

For above issue -
This solution will work
//ul//li[contains(.,"outer")]
"." Selects the current node

Related

XPath node that doesn't contain a child

I'm trying to access a certain element from by using XML but I just can't seem to get it, and I don't understand quite why.
<ul class="test1" id="content">
<li class="list">
<p>Insert random text here</p>
<div class="author">
</div>
</li>
<li class="list">
<p>I need this text here</p>
</li>
</ul>
Basically the text I want is the second one but I want/need to use something similar to p[not(div)] as to retrieve it.
I have tried the methods from the following link but to no avail (xpath find node that does not contain child)
Here is how I tried accessing the text:
ul[contains(#id,"content")]//p[not(.//div)]/text()
If you have any possible answers, thank you !
The HTML snippet posted in question shows that both p elements do not contain any div, so the expression //p[not(.//div)] would match both p. The first p element is sibling of the div (both shares the same parent element li) instead of parent or ancestor. The following XPath expression would match text nodes from the 2nd p and not those from the first one:
//ul[contains(#id,"content")]/li[not(div)]/p/text()
Brief explanation:
//ul[contains(#id,"content")]: find ul elements where id attribute value contains text "content"
/li[not(div)]: from such ul find child elements li that don't have child element div. This will match only the end li in the example HTML
/p/text(): from such li, find child elements p and then return child text nodes form such p

scrapy and xpath: get the text in a child element, if the parent element contains text

How do i get the text of a child element, if the parent element contains text with a specific string?
For example:
<li>
"string1"
<span>
"Hello"
</span>
</li>
<li>
"string2"
<span>
"Ola"
</span>
</li>
From the above html code, how to get only string "Ola" using xpath?
Without knowing scrapy, I would try
//li[text()[contains(.,"string2")]]/span/text()
//li[text()[contains(.,"string2")]] select a li element that text contains string2
/span select a element span below the selected li
/text(): return the text of the selected span element
Update: This is simpler and should also work:
//li[contains(text(),"string2")]/span/text()

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

xpath translate two expressions

I can't figure out two expressions in xpath. Can someone help ?
Here they are
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
//h2[normalize-space(string())='name']/preceding::h1[1]
Your first expression:
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
First this may find all ul elements which are at (self) or a descendant of the context of your XPath. These must have an id attribute with the value 'biblio' to me matched, from there it will find the 3rd li child element(s) from the matching ul element(s).
It will then perform the substring functions on the text() of the li element(s) after atmomizing them to a string.
So for example if the text of a matched li element was hello [world]. You would end up with just world as the result. As a more complete example, given the XML input:
<div>
<ul id="biblio">
<li>thing [one]</li>
<li>thing [two]</li>
<li>thing [three]</li>
</ul>
<ul id="biblio">
<li>other [a]</li>
<li>other [b]</li>
<li>other [c]</li>
</ul>
</div>
You would get a sequence of two strings as the result of your XPath expression which would be three and c. Note that the use of <div> in the example input is just a container and could be any element.
Your second expression:
//h2[normalize-space(string())='name']/preceding::h1[1]
First this may find all the h2 elements which are at (self) or a descendant of the context of your XPath. These must have a text() that when atmomised to a string is equal to name. From there you then select the 1st preceding h1.
So for example, given the XML input:
<div>
<h1>title1</h1>
<p>stuff</p>
<h1>title2</h1>
<p>more stuff</p>
<h2>name</h2>
<p>other stuff</p>
</div>
You would get the following XML output as a result of your XPath expression:
<h1>title2</h1>
Hope that helps you understand...

How to get the preceding element?

<p class="small" style="margin: 16px 4px 8px;">
<b>
<a class="menu-root" href="#pg-jump">Pages</a>
:
<b>1</b>
,
<a class="pg" href="viewforum.php?f=941&start=50">2</a>
,
<a class="pg" href="viewforum.php?f=941&start=100">3</a>
...
<a class="pg" href="viewforum.php?f=941&start=8400">169</a>
,
<a class="pg" href="viewforum.php?f=941&start=8450">170</a>
,
<a class="pg" href="viewforum.php?f=941&start=8500">171</a>
<a class="pg" href="viewforum.php?f=941&start=50">Next.</a>
</b>
</p>
I want to catch a element containing 171. So basically the preceding element from the Next.
//a[.='Next.']//Not sure how to use preceding here
You can use this xpath:
//a[.="Next."]/preceding::a[1]
If I were to diagram it out, using an X to represent the current location, it would look like this:
------------------+------+------------------
preceding-sibling | self | following-sibling
------------------|------|------------------
last() ... 2 1 | X | 1 2 ... last()
------------------+------+------------------
//a[contains(text(), 'Next.')]/preceding::a[contains(text(), '171')]
Explanation of xpath: Using text method along with <a> tag and then move ahead with preceding keyword to locate the element 171
I know this is old and if you didn't know the containing element preceding the "Name." element this wouldn't be a solution for you. BUT, if you were wanting to find exactly that element and there are several "171" elements all over the page.
The way to distinguish it from the rest, you could use the following.
//p[b[contains(., 'Next.')]]//a[contains(., '171')]

Resources