Xpath: select h2 parent with no a child - xpath

I have the following html:
<div class="stack">
<h2 class="overflow">
<img src="http:..">
text
</h2>
<div class="sublist">
<table>
...
</table>
</div>
<h2 class="overflow">
link
</h2>
</div>
As you can see, the .sublist div always follows a with and some text, it's like the div is a sublist of the h2(the h2 is the title of the sublist). The other contains an anchor tag.
I'd like to get all the h2 tags which preceeds the div .sublist.
This is my current xpath clause:
//div[#class="stack"]/h2/*[not(descendant::a)]
And I end up getting different elements(a, div, img) but the h2 elements.

I'd like to get all the h2 tags which preceeds the div .sublist.
How about:
//div[class="sublist"]/preceding-sibling::h2

Try preceding-sibling:
//div[#class="stack"]/div[#class="sublist"]/preceeding-sibling::*

Related

Select element based on cousin value

Lets say I have this html (ignore tags names):
<div>
<card>
<h2>1</h2>
</card>
<footer>
<p>text 1</p>
</footer>
</div>
<div>
<card>
<h2>2</h2>
</card>
<footer>
<p>text 2</p>
</footer>
</div>
<div>
<card>
<h2>3</h2>
</card>
<footer>
<p>text 2</p>
</footer>
</div>
and I want to select p tag that have an h2 value of 2 (I will select p with text 2)
if I use this expression //h2[text()="2"]/../following::footer/p I will get 2 p tags.
How do I select only the p tag with cousin h2 value of 2 ?
EDIT: Robbie Averill answer was the first to work, but you should check other answers they are very good too.
You can navigate from the h2 matched up to the div that contains the element you want, then target footer/p elements from there:
//h2[text()="2"]/../../footer/p
Try to use below XPath to select required element:
//card[h2="2"]/following-sibling::footer/p
This XPath,
//div[card/h2="2"]/footer/p
will select footer/p cousins of card/h2 elements with string values of 2.

How to select the first occurrence in each element by XPath?

In the following html tags:
<div>
<div>
<h3>
<a href='http://Ali.org'></a>
</h3>
<div>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
<div>
<h4>
<a href='http://Ali.org'></a>
</h4>
<p>
<a href='http://Mohammad.org'></a>
</p>
</div>
</div>
I want to select two 'a' tags 'http://Ali.org' & 'http://YaALi.org'. By the following, I can:
//div//a[not(parent::*[not(following-sibling::*)])]
But what about a simpler XPath?
By the following, all of 'a' tags will be selected since they are all the first child of their parents:
//div/div//a[1]
Or by the following, just the first 'a' tag will be selected:
(//div//a)[1]
I want to select 'a' tags that are the first in the 'a' tags of div elements...
// in the middle of a path is an abbreviation for descendant-or-self::node(), so if you do
//div/div//a[1]
this effectively means
//div/div/descendant-or-self::node()/a[1]
This picks the first child a of all descendant nodes. What you want is:
//div/div/descendant::a[1]
which will pick the first descendant a.

XPATH Firebug filter does not filter as expected

Basically I have a list of presidents and I am only interested in the Nixon link and not Clinton or Obama.
What I find is that filtering as I have done returns the correct number of presidents (ie 1 in this case) but returns ALL of the a links instead of just the one for Nixon.
HTML:
<div class="headlineBlock">
<h2>Obama</h2>
<p class="tudor"><strong>Conditions:</strong> Always sunny </p>
<table class="resultGrid"><tr> <td class="first">
<h4><a href="http://www.thelinkiwant.com?params" title="Click to view result"</a></h4>
<div class="headlineBlock">
<h2>Nixon</h2>
<p class="nixon"><strong>Conditions:</strong> Sometimes late </p>
<table class="resultGrid"><tr> <td class="first">
<h4><a href="http://www.thelinkiwant.com/?params" title="Click to view result"</a></h4>
<div class="headlineBlock">
<h2>Clinton</h2>
<p class="tudor"><strong>Conditions:</strong> Never rainy </p>
<table class="resultGrid"><tr> <td class="first">
<h4><a href="http://www.thelinkiwant/?params" title="Click to view result"</a></h4>
XPATH:
$x("//div[#class='headlineBlock']/h2[not(contains('|Clinton|Obama|',concat('|',.,'|') ))]//../../table/a/#href")
There are several issues with your example.
There brackets missing after ever single "Click to view result", your "headlineBlock" divs and tables aren't closed, etc. So first you should make sure that your data is well formatted.
W3C's Xml Validator can help with that
Your XPath looks mostly ok, I think the issue is with the // at the end - they are a bit too early. Try this instead:
//div[#class='headlineBlock']/h2[not(contains('|Clinton|Obama|',concat('|',.,'|') ))]/..//a/#href
//div[#class='headlineBlock']
All divs of class headlineBlock ...
/h2[not(contains('|Clinton|Obama|',concat('|',.,'|') ))]
... that don't contain certain terms.
/..
Up one level (now we are at div headlineBlock again)
//a
Any direct descendants of element type a
/#href
H-Ref Attribute

Simple dom document iteration

I have an HTML as so:
<html>
<body>
<div class="somethingunneccessary"></div>
<div class="container">
<div>
<p>text1</p>
<p>text2</p>
<p>text3</p>
</div>
<div>
<p>text4/p>
<p>text5</p>
<p>text6</p>
</div>
<div>
<p>text7</p>
<p>text8</p>
<p>text9</p>
</div>
<div>
<p>text10</p>
<p>text11</p>
<p>text12</p>
</div>
<div>
<p>text13</p>
<p>text14</p>
<p>text15</p>
</div>
</div>
</body>
</html>
What I'm trying to accomplish is the following:
1./ Loop over the div elements within the div having a class container.
2./ During the iteration I want to grab the text from the 3rd p tag.
The looping part is essential instead of just slicing out the p tags by themselves
I've got some code done but it doesn't do looping:
$doc=new DOMDocument();
$doc->loadHTML($htmlsource);
$xpath = new DOMXpath($doc);
$commentxpath = $xpath->query("/html/body/div[2]/div[5]/p[3]");
$commentdata = $commentxpath->item(0)->nodeValue;
How do I loop through each inner div element and extract the 3rd p tag.
Like I said, the looping is essential.
During the iteration I want to grab the text from the 3rd p tag
Try:
"//div[#class='container']/div/p[3]"
This should return all third p in all div inside of div with class container.
You may have to query over attributes: php xpath get attribute value
$xpath->query("/html/body/div[#class='container']");
Just try
/html/body/div/div//p
That should return only the p elements XD

Phantom <span> element using ImportXML with XPath in Google Spreadsheet

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.
The attribute value i seek is content found in the <span> with itemprop="price".
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
...
</div>
I can access <div class="left"> but i can't get to the <span> element.
Tried using:
//span[#class='pret']/#content i get #N/A;
//span[#itemprop='price']/#content i get #N/A;
//div[#class='left']/span[#class='pret' and #itemprop='price']/#content i get #N/A;
//div[#class='left']/span[1]/#content i get #N/A;
//div[#class='left']/span/text() to get the text node of <span> i get #N/A;
//div[#class='left']//span/text() i get the text node of a <span> lower in div.left.
To get the text node of <span> i have to use //div[#class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.
It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").
I tried to generate the XPath directly form the development tool by right clicking and i get //*[#id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".
Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[#itemprop='price']/#content works.
I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?
Perhaps the following formulas can help you:
=ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']/text()")
Or
=INDEX(ImportXML("http://...","//div[#class='product-info-price']//div[#class='left']"), 1, 2)
UPDATE
It seems that not properly parse the entire document, it fails. A document extraction, something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
<div class="left" style="margin-top: 10px;">
<meta itemprop="currency" content="RON">
<span class="pret" itemprop="price" content="698,31 RON">
<p class="pret">Pretul tau:</p>
698,31 RON
</span>
<div class="resealed-info">
ยป Vezi 1 resigilat din aceasta categorie
</div>
<ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
<li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin BRD</li>
<li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
<li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
</ul>
</div>
<div class="right" style="height: 103px;line-height: 103px;">
<form action="/?a=shopping&sa=addtocart" method="post" id="add_to_cart_form">
<input type="hidden" name="product-183641" value="on"/>
<img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/>
</form>
</div>
</div>
</html>
works with the following XPath query:
"//div[#class='product-info-price']//div[#class='left']//span[#itemprop='price']/#content"
UPDATE
It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:
/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
}
return content;
}
Then you can use as follows:
=MyImportXML("http://...")
At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639
//b[#itemprop='price']
Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.
Before:
<meta itemprop="currency" content="RON">
Now
<meta itemprop="priceCurrency" content="RON" />
For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.
Try smth like this:
print 'content by key',tree.xpath('//*[#itemprop="price"]')[0].get('content')
or
nodes = tree.xpath('//div/meta/span')
for node in nodes:
print 'content =',node.get('content')
But i haven't tried that.

Resources