get tags based on the next following-sibling only. xpath - xpath

I have HTML like this, and I want to get only those <p> tags that have the next sibling <ul> only.
<div>
<p>1</p>
<p>2</p>
<ul>...</ul>
<p>3</p>
<ul>...</ul>
</div>
In the above example, I only want XPath to return the second and third <p> tag. Not the first one. I have tried using following-sibling but that didn't work out.

This xpath will get p with an ul immediate sibling
//p[./following-sibling::*[position()=1][name()="ul"]]
or
//p[./following-sibling::*[position()=1 and name()="ul"]]
Testing on command line
xmllint --html --recover --xpath '//p[./following-sibling::*[position()=1][name()="ul"]]' test.html
Result
<p>2</p><p>3</p>
The name function returns a string representing the QName of the first node in a given node-set.
https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/name
According to the above, position()=1 and name()="ul" is probably redundant and name()="ul" would be enough.

Related

xPath - Why is this exact text selector not working with the data test id?

I have a block of code like so:
<ul class="open-menu">
<span>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text Here</strong>
<small>...</small>
</div>
</li>
<li data-testid="menu-item" class="menu-item option">
<svg>...</svg>
<div>
<strong>Text</strong>
<small>...</small>
</div>
</li>
</span>
</ul>
I'm trying to select a menu item based on exact text like so in the dev tools:
$x('.//*[contains(#data-testid, "menu-item") and normalize-space() = "Text"]');
But this doesn't seem to be selecting the element. However, when I do:
$x('.//*[contains(#data-testid, "menu-item")]');
I can see both of the menu items.
UPDATE:
It seems that this works:
$x('.//*[contains(#class, "menu-item") and normalize-space() = "Text"]');
Not sure why using a class in this context works and not a data-testid. How can I get my xpath selector to work with my data-testid?
Why is this exact text selector not working
The fact that both li elements are matched by the XPath expression
if omitting the condition normalize-space() = "Text" is a clue.
normalize-space() returns ... Text Here ... for the first li
in the posted XML and ... Text ... for the second (or some other
content in place of ... from div/svg or div/small) causing
normalize-space() = "Text" to fail.
In an update you say the same condition succeeds. This has nothing to
do with using #class instead of #data-testid; it must be triggered
by some content change.
How can I get my xpath selector to work with my data-testid?
By testing for an exact text match in the li's descendant strong
element,
.//*[#data-testid = "menu-item" and div/strong = "Text"]
which matches the second li. Making the test more robust is usually
in order, e.g.
.//*[contains(#data-testid,"menu-item") and normalize-space(div/strong) = "Text"]
Append /div/small or /descendant::small, for example, to the XPath
expression to extract just the small text.
data-testid="menu-item" is matching both the outer li elements while text content you are looking for is inside the inner strong element.
So, to locate the outer li element based on it's data-testid attribute value and it's inner strong element text value you can use XPath expression like this:
//*[contains(#data-testid, "menu-item") and .//normalize-space() = "Text"]
Or
.//*[contains(#data-testid, "menu-item") and .//*[normalize-space() = "Text"]]
I have tested, both expressions are working correctly

Xpath get text of nested item not working but css does

I'm making a crawler with Scrapy and wondering why my xpath doesn't work when my CSS selector does? I want to get the number of commits from this html:
<li class="commits">
<a data-pjax="" href="/samthomson/flot/commits/master">
<span class="octicon octicon-history"></span>
<span class="num text-emphasized">
521
</span>
commits
</a>
</li
Xpath:
response.xpath('//li[#class="commits"]//a//span[#class="text-emphasized"]//text()').extract()
CSS:
response.css('li.commits a span.text-emphasized').css('::text').extract()
CSS returns the number (unescaped), but XPath returns nothing. Am I using the // for nested elements correctly?
You're not matching all values in the class attribute of the span tag, so use the contains function to check if only text-emphasized is present:
response.xpath('//li[#class="commits"]//a//span[contains(#class, "text-emphasized")]//text()')[0].strip()
Otherwise also include num:
response.xpath('//li[#class="commits"]//a//span[#class="num text-emphasized"]//text()')[0].strip()
Also, I use [0] to retrieve the first element returned by XPath and strip() to remove all whitespace, resulting in just the number.

Xpath fetch specific nodes without their child nodes from XML

I have XML data that looks like this
<priceData>
<div class='price'>
<div class='price-old'>20.00</div>
<div class='price-new'>10.00</div>
<div class='price-tax'>8.00</div>
</div>
<div class='price'>
40.00 <div class='price-tax'>25.00</div>
</div>
</priceData>
I want to use Xpath to extract data for "price-new" from the first price div, and value 40.00 from the second price div. This must be done using single expression.
I tried expressions like
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(contains(#class, '-old'))]
and
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(descendant::div[contains(#class, '-old') and not(contains(#class, '-tax'))]) and not(contains(#class, '-old'))]
and some others but I can't get it to work how it is supposed to.
I always end up with fetching extra nodes from the first case and I only need the single node (price-new or price if there are no more nodes in it).
You can try using xpath union (|) to combine 2 queries into one. Given markup in the question as XML input, the following xpath (formatted for readability) :
//div[#class='price']/div[#class='price-new']/text()
|
//div[#class='price']/text()[normalize-space()]
returned 'expected' result in xpath tester :
Text='10.00'
Text='40.00'

Xpath - matching based on node() contains() content

I have the following HTML structure (there are many blocks using the same architecture):
<span id="mySpan">
<i>
Price
<b>
3 900
<small>€</small>
</b>
</i>
</span>
Now, I want to get the content of <b> using Xpath which I tried like so:
//span[#id="mySpan"]/i/node()[1][contains(text(),"Price")]
which does match anything. How can I match this using the node()[1] text as anchor?
Regarding the Xpath you tried, instead of text() which return text node child, simply use . :
//span[#id="mySpan"]/i/node()[1][contains(.,"Price")]
For the ultimate goal, I'd suggest this XPath :
//span[#id="mySpan"]/i[contains(.,"Price")]/b
or if you want specifically to match against the first node within <i> :
//span[#id="mySpan"]/i[contains(node(),"Price")]/b

xpath translate two expressions

I can't figure out two expressions in xpath. Can someone help ?
Here they are
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
//h2[normalize-space(string())='name']/preceding::h1[1]
Your first expression:
substring-after(substring-before(//ul[#id='biblio']/li[3], ']', '['))
First this may find all ul elements which are at (self) or a descendant of the context of your XPath. These must have an id attribute with the value 'biblio' to me matched, from there it will find the 3rd li child element(s) from the matching ul element(s).
It will then perform the substring functions on the text() of the li element(s) after atmomizing them to a string.
So for example if the text of a matched li element was hello [world]. You would end up with just world as the result. As a more complete example, given the XML input:
<div>
<ul id="biblio">
<li>thing [one]</li>
<li>thing [two]</li>
<li>thing [three]</li>
</ul>
<ul id="biblio">
<li>other [a]</li>
<li>other [b]</li>
<li>other [c]</li>
</ul>
</div>
You would get a sequence of two strings as the result of your XPath expression which would be three and c. Note that the use of <div> in the example input is just a container and could be any element.
Your second expression:
//h2[normalize-space(string())='name']/preceding::h1[1]
First this may find all the h2 elements which are at (self) or a descendant of the context of your XPath. These must have a text() that when atmomised to a string is equal to name. From there you then select the 1st preceding h1.
So for example, given the XML input:
<div>
<h1>title1</h1>
<p>stuff</p>
<h1>title2</h1>
<p>more stuff</p>
<h2>name</h2>
<p>other stuff</p>
</div>
You would get the following XML output as a result of your XPath expression:
<h1>title2</h1>
Hope that helps you understand...

Resources