I'm trying to scrape a site using a highly varying HTML structure. The information at interest is not encapsulated. The only marker is a span with a target id TARGETID.
Structure is:
<h2>
<span class="TARGETID">TARGETID</span>
</h2>
<p> <!-- this is not always present, could be more p tags --> </p>
<ul> <!-- also not always present, if there, this is what we want --> </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
My approach was:
//h2/span[contains(text(), 'TARGETID')]/../following-sibling::ul[1][count(li) > 1][li]//a/text()
Which succeeds when a unordered list is present after the TARGETID, but if not, it takes the next unordered list it finds (which makes sense based on the query).
My question is: How can I limit the query to the nodes of two H2's, starting with the one containing a span with the target id and limited by any following H2 with a span of a different id?
Any hints are greatly appreciated.
This XPath,
//ul[preceding::h2[1][.='TARGETID']]//a
will select all a elements beneath a ul that occurs after a h2 with string value of "TARGETID" but before any other h2 elements.
So, for this expanded example,
<div>
<h2>
<span class="TARGETID">TARGETID</span>
</h2>
<p> <!-- this is not always present, could be more p tags --> </p>
<ul> link1 </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
<ul> link2 </ul>
<h2>
<span class="SOMEIRRELEVANTID">IRRELEVANT</span>
</h2>
</div>
it would select only
link1
and not link2, as requested.
Related
From the following XML-document, I'm trying to specify XPath that will capture the text that immediately follows the h4-headline "Source", namely - in this example - "Information about the source":
<div class="doc-inf doc-inf-information">
<h3>Document information</h3>
<div>
<h4>Source</h4>
<ul>
<li>Information about the source</li>
</ul>
I've tried the following:
//h4[contains(text(), "Source")]/ul/li'
Which doesn't seem to work. Would anyone be able to help? I would greatly appreciate it.
EDIT:
My problem (which I didn't specify fully, sorry) is that this div tag has multiple h4 tags in it of which I want to select the ul-child for each:
<div class="doc-inf doc-inf-information">
<h3>Document information</h3>
<div>
<h4>Source</h4>
<ul>
<li>Source information</li>
</ul>
<h4>Language</h4>
<ul>
<li>Swedish</li>
</ul>
<h4>Publishers</h4>
<ul>
<li>Publishing Project</li>
</ul>
<h4>Record ID</h4>
<ul>
<li>36785</li>
</ul>
In essence, I'm trying to grab the child under h4 headlines "Source", "Language", "Publishers", "Record ID" (= what I'm interested in is "Source information", "Swedish", "Publishing Project" and "36785") but the h4 headlines are inconsistently placed across pages so I need to be able to target the children of the specific headlines.
You are directly accessing the tag <h4>, which has no children, therefore the following doesn't work:
//h4[contains(text(), "Source")]/ul/li
Try this instead:
//div[h4[contains(text(), "Source")]]/ul/li/text()
which searches for a <div> that has the tag <h4> in it with the text 'Source' and then it selects the <ul> child.
In the following html tags:
<div>
<div>
<h3>
<a href='http://Ali.org'></a>
</h3>
<div>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
<div>
<h4>
<a href='http://Ali.org'></a>
</h4>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
I want to select two 'a' tags 'http://Ali.org' & 'http://YaALi.org'. By the following, I can:
//div//a[not(parent::*[not(following-sibling::*)])]
But what about a simpler XPath?
By the following, all of 'a' tags will be selected since they are all the first child of their parents:
//div/div//a[1]
Or by the following, just the first 'a' tag will be selected:
(//div//a)[1]
I want to select 'a' tags that are the first in the 'a' tags of div elements...
// in the middle of a path is an abbreviation for descendant-or-self::node(), so if you do
//div/div//a[1]
this effectively means
//div/div/descendant-or-self::node()/a[1]
This picks the first child a of all descendant nodes. What you want is:
//div/div/descendant::a[1]
which will pick the first descendant a.
I am new to nokogiri and so far most familiar with CSS selectors, I am trying to parse information from a table, below is a sample of the table and the code I'm using, I'm stuck on the appropriate if statement, as it seems to return the whole contents of the table.
Table:
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
...
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
My Script: (if SPECIFIC TEXT is found in the table it returns every "div.c2 span.data" variable - so I've either screwed up my knowledge of do loops or if statements)
data = []
page.agent.get(url)
page.search('div.row').each do |row_data|
if (row_data.search('div.c1:contains("/SPECIFIC TEXT/")').text.strip
temp = row_data.search('div.c2 span.data').text.strip
data << temp
end
end
There's no need to stop and insert ruby logic when you can extract what you need in a single CSS selector.
data = page.search('div.row > div.c1:contains("SPECIFIC TEXT") + div.c2 span.data')
This will include only those that match the selector (e.g. follow the SPECIFIC TEXT).
Here's where your logic may have gone wrong:
This code
if (row_data.search('div.c1:contains("SPECIFIC TEXT")'...
temp = row_data.search('div.c2 span.data')...
first searches the row for the specific text, then if it matches, returns ALL rows matching the second query, which has the same starting point. The key is the + in the CSS selector above which will return elements immediately following (e.g. the next sibling element). I'm making an assumption, of course, that the next element is always what you want.
I'd do
require 'nokogiri'
html = <<_
<div class="holder">
<div class ="row">
<div class="c1">
<!-- Content I Don't need -->
</div>
<div class="c2">
<span class="data">
<!-- Content I Don't Need -->
<span class="data">
</div>
</div>
<div class="row">
<div class="c1">
SPECIFIC TEXT
</div>
<div class="c2">
<span class="data">
What I want
</span>
</div>
</div>
</div>
_
doc = Nokogiri::HTML(html)
css_string = 'div.row > div.c1[text()*="SPECIFIC TEXT"] + div.c2 span.data'
doc.at(css_string).text.strip
# => "What I want"
How those selectors would work here -
[name*="value"] - Selects elements that have the specified attribute with a value containing the a given substring.
Child Selector (“parent > child”) - Selects all direct child elements specified by "child" of elements specified by "parent".
Next Adjacent Selector (“prev + next”) - Selects all next elements matching "next" that are immediately preceded by a sibling "prev".
Class Selector (“.class”) - Selects all elements with the given class.
Descendant Selector (“ancestor descendant”) - Selects all elements that are descendants of a given ancestor.
I am trying to get the error message off of a page from a site. The list contains several possible errors so i can't check by id; but I do know that the one with display:list-item is the one I want. This is my rule but doesn't seem to work, what is wrong with it? What I want returned is the error text in the element.
//*[#id='errors']/ul/li[contains(#style,'display:list-item')]
Example dom elements:
<div id="errors" class="some class" style="display: block;">
<div class="some other class"></div>
<div class="some other class 2">
<span class="displayError">Please correct the errors listed in red below:</span>
<ul>
<li style="display:none;" id="invalidId">Enter a valid id</li>
<li style="display:list-item;" id="genericError">Something bad happened</li>
<li style="display:none;" id="somethingBlah" ............ </li>
....
</ul>
</div>
The correct XPath should be:
//*[#id='errors']//ul/li[contains(#style,'display:list-item')]
After //*[#id='errors'] you need an extra /, because <ul> is not directly beneath it. Using // again scans all underlying elements for <ul>.
If you are capable to not use // it would be better and faster and less consuming.
I want to Select all the LI elements which contain SPAN with id="liveDeal152_dealPrice" as descendents. How do i do this with xpath?
Here is a sample html
<ul>
<li id="liveDeal_152">
<p class="price">
<em>|
<span class="WebRupee">₹ </span>
<span id="liveDeal152_dealPrice">495 </span>
</p>
</li>
<li id="liveDeal_152">
<p class="price">
<em>|
<span class="WebRupee">₹ </span>
(price hidden)
</p>
</li>
</ul>
//li[.//span[#id = 'liveDeal152_dealPrice']] should do. Or more verbose but closer to your textual description //li[descendant::span[#id = 'liveDeal152_dealPrice']].
Use this
//li[.//span[#id="liveDeal152_dealPrice"]]
It selects
ALL <li> ELEMENTS
//li[ ]
THAT HAVE A <span> DESCENDANT
.//span[ ]
WITH id ATTRIBUTE EQUAL TO "liveDeal152_dealPrice"
#id="liveDeal152_dealPrice"
That said, it doesn't seem like a very wise element selection, mostly due to the dynamically looking id. If you're going to use it once, it's probably ok, but if you're using it, say, for testing and will reuse it many times, it might cause trouble. Are you sure this won't change when you change your website and/or database?
As a side note:
ul stands for "unordered list"
ol stands for "ordered list"
li stands for "list item"