Xpath - matching based on node() contains() content - ruby

I have the following HTML structure (there are many blocks using the same architecture):
<span id="mySpan">
<i>
Price
<b>
3 900
<small>€</small>
</b>
</i>
</span>
Now, I want to get the content of <b> using Xpath which I tried like so:
//span[#id="mySpan"]/i/node()[1][contains(text(),"Price")]
which does match anything. How can I match this using the node()[1] text as anchor?

Regarding the Xpath you tried, instead of text() which return text node child, simply use . :
//span[#id="mySpan"]/i/node()[1][contains(.,"Price")]
For the ultimate goal, I'd suggest this XPath :
//span[#id="mySpan"]/i[contains(.,"Price")]/b
or if you want specifically to match against the first node within <i> :
//span[#id="mySpan"]/i[contains(node(),"Price")]/b

Related

get tags based on the next following-sibling only. xpath

I have HTML like this, and I want to get only those <p> tags that have the next sibling <ul> only.
<div>
<p>1</p>
<p>2</p>
<ul>...</ul>
<p>3</p>
<ul>...</ul>
</div>
In the above example, I only want XPath to return the second and third <p> tag. Not the first one. I have tried using following-sibling but that didn't work out.
This xpath will get p with an ul immediate sibling
//p[./following-sibling::*[position()=1][name()="ul"]]
or
//p[./following-sibling::*[position()=1 and name()="ul"]]
Testing on command line
xmllint --html --recover --xpath '//p[./following-sibling::*[position()=1][name()="ul"]]' test.html
Result
<p>2</p><p>3</p>
The name function returns a string representing the QName of the first node in a given node-set.
https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/name
According to the above, position()=1 and name()="ul" is probably redundant and name()="ul" would be enough.

How to exclude a child node from xpath?

I have the following code :
<div class = "content">
<table id="detailsTable">...</table>
<div class = "desc">
<p>Some text</p>
</div>
<p>Another text<p>
</div>
I want to select all the text within the 'content' class, which I would get using this xPath :
doc.xpath('string(//div[#class="content"])')
The problem is that it selects all the text including text within the 'table' tag. I need to exclude the 'table' from the xPath. How would I achieve that?
XPath 1.0 solutions :
substring-after(string(//div[#class="content"]),string(//div[#class="content"]/table))
Or just use concat :
concat(//table/following::p[1]," ",//table/following::p[2])
The XPath expression //div[#class="content"] selects the div element - nothing more and nothing less - and applying the string() function gives you the string value of the element, which is the concatenation of all its descendant text nodes.
Getting all the text except for that containing in one particular child is probably not possible in XPath 1.0. With XPath 2.0 it can be done as
string-join(//div[#class="content"]/(node() except table)//text(), '')
But for this kind of manipulation, you're really in the realm of transformation rather than pure selection, so you're stretching the limits of what XPath is designed for.

Xpath get text of nested item not working but css does

I'm making a crawler with Scrapy and wondering why my xpath doesn't work when my CSS selector does? I want to get the number of commits from this html:
<li class="commits">
<a data-pjax="" href="/samthomson/flot/commits/master">
<span class="octicon octicon-history"></span>
<span class="num text-emphasized">
521
</span>
commits
</a>
</li
Xpath:
response.xpath('//li[#class="commits"]//a//span[#class="text-emphasized"]//text()').extract()
CSS:
response.css('li.commits a span.text-emphasized').css('::text').extract()
CSS returns the number (unescaped), but XPath returns nothing. Am I using the // for nested elements correctly?
You're not matching all values in the class attribute of the span tag, so use the contains function to check if only text-emphasized is present:
response.xpath('//li[#class="commits"]//a//span[contains(#class, "text-emphasized")]//text()')[0].strip()
Otherwise also include num:
response.xpath('//li[#class="commits"]//a//span[#class="num text-emphasized"]//text()')[0].strip()
Also, I use [0] to retrieve the first element returned by XPath and strip() to remove all whitespace, resulting in just the number.

How to write the single xpath when the text is in two lines

How to write the single xpath for this
<div class="col-lg-4 col-md-4 col-sm-4 profilesky"> <div class="career_icon">
<span> Boost </span> <br/>
Your Profile </div>
I am able to write by two line using "contains" method.
.//*[contains(text(),'Boost')]
.//*[contains(text(),'Your Profile')]
But i want in a single line to write the xpath for this.
You can try this way :
.//*[#class='career_icon' and contains(., 'Boost') and contains(., 'Your Profile')]
Above xpath check if there is an element having class attribute equals career_icon and contains both Boost and Your Profile texts in the element body.
Note that text() only checks direct child text node. To check entire text content of an element simply use dot (.).
You can combine several rules just by writing them one after another since they refer to the same element:
.//[contains(text(),'Boost')][contains(text(),'Your Profile')]

How to return xpath union of nodes in separate trees?

It's a basic question, but I couldn't find the answer anywhere.
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<c2>
<d2>D2</d2>
<e2>E2</e2>
</c2>
</b2>
</a>
From the above I'd like to return:
<a>
<d1>D1</d1>
<e1>E1</e1>
<d2>D2</d2>
<e2>E2</e2>
</a>
And not:
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<d2>D2</d2>
<e2>E2</e2>
</b2>
</a>
If that makes any sense. I tried "/a", but that gave me:
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<c2>D2E2</c2>
</b2>
</a>
If you meant to select all leave nodes (nodes without child node(s)), you can try this XPath :
//*[not(*)]
Or using XPath union (|) to get child nodes of <b1> and <c2> :
(//b1/* | //c2/*)
Given sample XML you posted, both XPath above will return :
<d1>D1</d1>
<e1>E1</e1>
<d2>D2</d2>
<e2>E2</e2>
But if you really need the result to be wrapped in <a>, then I agree with #minopret comment, that isn't what XPath meant to do. XSLT is more proper way to transform an XML to different format.
UPDATE :
In respond to your last comment, there is no such grouping in XPath. Should be done in the host language if you need that data structure. Your best bet is to select parent of those desired nodes in XPath so you get them grouped by their parent. Then you can do further processing in the host language, for example :
//*[not(*)]/parent::*
//*[*[not(*)]]
Any of above two XPath queries can return :
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<c2>
<d2>D2</d2>
<e2>E2</e2>
</c2>
XPath can only return nodes that are already present in your source tree. To construct new nodes, or reorganise the tree, you need XSLT or XQuery.

Resources