How to select the first occurrence in each element by XPath? - xpath

In the following html tags:
<div>
<div>
<h3>
<a href='http://Ali.org'></a>
</h3>
<div>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
<div>
<h4>
<a href='http://Ali.org'></a>
</h4>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
I want to select two 'a' tags 'http://Ali.org' & 'http://YaALi.org'. By the following, I can:
//div//a[not(parent::*[not(following-sibling::*)])]
But what about a simpler XPath?
By the following, all of 'a' tags will be selected since they are all the first child of their parents:
//div/div//a[1]
Or by the following, just the first 'a' tag will be selected:
(//div//a)[1]
I want to select 'a' tags that are the first in the 'a' tags of div elements...

// in the middle of a path is an abbreviation for descendant-or-self::node(), so if you do
//div/div//a[1]
this effectively means
//div/div/descendant-or-self::node()/a[1]
This picks the first child a of all descendant nodes. What you want is:
//div/div/descendant::a[1]
which will pick the first descendant a.

Related

Select nodes that 1) precede a given node but 2) are also descendants of another given node

Say I have the following XML:
<body>
<div id="global-header">
header
</div>
<div class="a">
<h3>some title</h3>
<p>text 1</p>
<p>text 2</p>
<p>text 3</p>
</div>
</body>
I want to
find any <p> node whose value is "text 2", and then
find all the nodes that precede this particular <p> but are also descendants of the <div class='a'> node.
The desired output should look like:
<h3>some title</h3>
<p>text 1</p>
The caveat is that the preceding nodes may contain arbitrary node type, not only <h3> and <p>, as in the above case.
My first try:
.//p[text()="text 2"]/preceding::*
Unfortunately, this will also select <div id="global-header">, which is not desired.
You need to use preceding-sibling to select nodes that are children of the same parent instead of preceding:
.//p[text()="text 2"]/preceding-sibling::*

XPATH: Select a node whose children do not containg some text

I'm trying to select a node whose children do not contain some specific text.
For example:
<div class="b-margin">
<div class="tag">Pt</div>
<div class="tag">En</div>
</div>
<div class="b-margin">
<div class="tag">Ru</div>
<div class="tag">En</div>
</div>
How would i go about selecting the 'div class="b-margin"' nodes that do not have children with the text "Pt"?
Here is the simple xpath.
//div[#class='b-margin' and not(div[.='Pt'])]
Screenshot:

Xpath: Finding node next to node if present

I'm trying to scrape a site using a highly varying HTML structure. The information at interest is not encapsulated. The only marker is a span with a target id TARGETID.
Structure is:
<h2>
<span class="TARGETID">TARGETID</span>
</h2>
<p> <!-- this is not always present, could be more p tags --> </p>
<ul> <!-- also not always present, if there, this is what we want --> </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
My approach was:
//h2/span[contains(text(), 'TARGETID')]/../following-sibling::ul[1][count(li) > 1][li]//a/text()
Which succeeds when a unordered list is present after the TARGETID, but if not, it takes the next unordered list it finds (which makes sense based on the query).
My question is: How can I limit the query to the nodes of two H2's, starting with the one containing a span with the target id and limited by any following H2 with a span of a different id?
Any hints are greatly appreciated.
This XPath,
//ul[preceding::h2[1][.='TARGETID']]//a
will select all a elements beneath a ul that occurs after a h2 with string value of "TARGETID" but before any other h2 elements.
So, for this expanded example,
<div>
<h2>
<span class="TARGETID">TARGETID</span>
</h2>
<p> <!-- this is not always present, could be more p tags --> </p>
<ul> link1 </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
<ul> link2 </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
</div>
it would select only
link1
and not link2, as requested.

nokogiri + mechanize css selector by text

I am new to nokogiri and so far most familiar with CSS selectors, I am trying to parse information from a table, below is a sample of the table and the code I'm using, I'm stuck on the appropriate if statement, as it seems to return the whole contents of the table.
Table:
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
...
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
My Script: (if SPECIFIC TEXT is found in the table it returns every "div.c2 span.data" variable - so I've either screwed up my knowledge of do loops or if statements)
data = []
page.agent.get(url)
page.search('div.row').each do |row_data|
if (row_data.search('div.c1:contains("/SPECIFIC TEXT/")').text.strip
temp = row_data.search('div.c2 span.data').text.strip
data << temp
end
end
There's no need to stop and insert ruby logic when you can extract what you need in a single CSS selector.
data = page.search('div.row > div.c1:contains("SPECIFIC TEXT") + div.c2 span.data')
This will include only those that match the selector (e.g. follow the SPECIFIC TEXT).
Here's where your logic may have gone wrong:
This code
if (row_data.search('div.c1:contains("SPECIFIC TEXT")'...
temp = row_data.search('div.c2 span.data')...
first searches the row for the specific text, then if it matches, returns ALL rows matching the second query, which has the same starting point. The key is the + in the CSS selector above which will return elements immediately following (e.g. the next sibling element). I'm making an assumption, of course, that the next element is always what you want.
I'd do
require 'nokogiri'
html = <<_
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
_
doc = Nokogiri::HTML(html)
css_string = 'div.row > div.c1[text()*="SPECIFIC TEXT"] + div.c2 span.data'
doc.at(css_string).text.strip
# => "What I want"
How those selectors would work here -
[name*="value"] - Selects elements that have the specified attribute with a value containing the a given substring.
Child Selector (“parent > child”) - Selects all direct child elements specified by "child" of elements specified by "parent".
Next Adjacent Selector (“prev + next”) - Selects all next elements matching "next" that are immediately preceded by a sibling "prev".
Class Selector (“.class”) - Selects all elements with the given class.
Descendant Selector (“ancestor descendant”) - Selects all elements that are descendants of a given ancestor.

xPath strange behaviour - selecting ALL elements even if [1] set

today I stumbled upon a very interesting case (at least for me). I am messing around with Selenium and xPath and tried to get some elements, but got a strange behaviour:
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some other text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some even unrelated text
</a>
</div>
</div>
</div>
This is my data.
When i run the following xPath query:
//div[#class="title"][1]/a
I get as a result ALL instead of only the first one. But if I query:
//div[#class="resultcontainer"][1]/div[#class="info"]/div[#class="title"]/a
I get only the first , not all.
Is there some divine reason behind that?
Best regards,
bisko
I think you want
(//div[#class="title"])[1]/a
This:
//div[#class="title"][1]/a
selects all (<a> elements that are children of) <div> elements that have a #class of 'title', that are the first children of their parents (in this context). Which means: it selects all of them.
The working XPath selects all <div> elements that have a #class of 'title' - and of those it takes the first one.
The predicates (the expressions in square brackets []) are applied to each element that matched the preceding location step (i.e. "//div") individually. To apply a predicate to a filtered set of nodes, you need to make the grouping clear with parentheses.
Consequently, this:
//div[1][#class="title"]/a
would select all <div> elements, take the first one, and then filter it down futher by checking the #class value. Also not what you want. ;-)

Resources