Is it possible to select the first element in each row which matches a specific class? This is the HTML structure at the moment.
<ul>
<li>
<article>
<time class="published-date"></time>
<p>Text</p>
</article>
</li>
<li>
<article>
<time class="published-date"></time>
<p>Text</p>
</article>
</li>
<ul>
I was wondering what would be the best and most specific query string in terms of getting the time element with the class published-date in each row?
If there are more time elements with class="published-date" in every row, you need to use indexing (1-based):
//ul/li/article/time[#class = "published-date"][1]
If there is only a single time element in every row, simply do:
//ul/li/article/time[#class = "published-date"]
Using the XPath selector....
//time[#class="published-date"]
...will select all time nodes with the class published-date. XPathFiddle
Related
I would like to show an example.
This how the page looks:
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Hello</span>
</div>
</a>
<a class="aclass">
<div class="divclass"></div>
<div id="innerclass">
<span class="spanclass">Pick Delivery Location</span>
</div>
</a>
I want to select anchor tags that have a child (direct or non-direct) span that has the text 'Hello'.
Right now, I do something like this:
//a[#class='aclass'][div/span[text() = 'Hello']]
I want to be able to select without having to select direct children (div in this case), like this:
//a[#class='aclass'][//span[text() = 'Hello']]
However, the second one finds all the anchor tags with the class 'aclass' rather than the one with the span with 'Hello' text.
I hope I worded my question clearly. Please feel free to edit if necessary.
In your attempt, // goes back to the root of the document - effectively you are saying "Give me the as for which there is a span anywhere in the document", which is why you get them all.
What you need is the descendant axis :
//a[#class='aclass' and descendant::span[text() = 'Hello']]
Note I have joined the conditions with and, but two separate conditions would also work.
There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].
Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.
I have the following code sample in an xmlns root:
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
I would like to perform a different operation on the first item in <li> depending on the type of tag it resides in, or no tag, i.e. the first <li> in the sample.
EDIT:
My logic in pursuing the task turns out to be incorrect.
How do I query a <li> that has no descendants as in the first list item?
I tried negation:
#doc.xpath("//xmlns:ol[#class='stan']//xmlns:li/xmlns:*[1][not(p|pre)]")
That gives me the exact opposite for what I think I am asking for.
I think I am making the expression more complicated since I can't find the right solution.
UPDATE:
Navin Rawat has answered this one in the comments. The correct code would be:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li[not(xmlns:*)]")
CORRECTION:
The correct question involves both an XPath search and a Nokogiri method.
Given the above xhtml code, how do I search for first descendant using xpath? And how do I use xpath in a conditional statement, e.g.:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li").each do |e|
if e.xpath("e has no descendants")
perform task
elsif e.xpath("e first descendant is <p>")
perform second task
elsif e.xpath("e first descendant is <pre>")
perform third task
end
end
I am not asking for complete code. Just the part in parenthesis in the above Nokogiri code.
Pure XPath answer...
If you have the following XML :
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
And want to select <li> that has no child element as in the first list item, use :
//ol/li[count(*)=0]
If you have namespaces problem, please give to whole XML (with the root element and namespaces declaration) so that we can help you dealing with it.
EDIT after our discussion, here is your final tested code :):
#doc.xpath("//xmlns:ol[#class='footnotes']/xmlns:li").each do |e|
if e.xpath("count(*)=0")
puts "No children"
elsif e.xpath("count(*[1]/self::xmlns:p)=1")
puts "First child is <p>"
elsif e.xpath("count(*[1]/self::xmlns:pre)=1")
puts "First child is <pre>"
end
end
I have an xml structure that looks like this:
<document>
<body>
<section>
<title>something</title>
<subtitle>Something again</subtitle>
<section>
<p xml:id="1234">Some text</p>
<figure id="2121"></figure>
<section>
<p xml:id="somethingagain">Some text</p>
<figure id="939393"></figure>
<p xml:id="countelement"></p>
</section>
</section>
</section>
<section>
<title>something2</title>
<subtitle>Something again2</subtitle>
<section>
<p xml:id="12345678">Some text2</p>
<figure id="939394"></figure>
<p xml:id="countelement2"></p>
</section>
</section>
</body>
</document>
How can I count the figure elemtens I have before the <p xml:id="countelement"></p> element using XPath?
Edit:
And i only want to count figure elements within the parent section, in the next section it should start from 0 again.
Given you're using an XPath 2.0 compatible engine, find the count element and call fn:count() for each of them with using all preceding figure-elements as input.
This will return the number of figures preceding each "countelement" on the same level (I guess this is what you actually want):
//p[#xml:id="countelement"]/count(preceding-sibling::figure)
This will return the number of figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding-sibling::figure | parent::*/preceding-sibling::figure)
This will return the number of all preceeding figures preceding each "countelement" and the level above:
//p[#xml:id="countelement"]/count(preceding::figure)
If you're bound to XPath 1.0, you won't be able to get multiple results. If #id really is an id (and thus unique), you will be able to use this query:
count(//p[#xml:id="countelement"]/preceding::figure)
If there are "countelements" which are not <p/> elements, replace p by *.
count(id("countelement")/preceding-sibling::figure)
Please note that the xml:id attributes of two different elements cannot the same value, such as "countelement". If you wish two different elements to have a same-named attribute with the same value "countelement", it must be some other attribute perhaps "kind" that is not of DTD attribute type ID. In that case in place of id("countelement") you would use *[#kind="countelement"].