I am trying to get the error message off of a page from a site. The list contains several possible errors so i can't check by id; but I do know that the one with display:list-item is the one I want. This is my rule but doesn't seem to work, what is wrong with it? What I want returned is the error text in the element.
//*[#id='errors']/ul/li[contains(#style,'display:list-item')]
Example dom elements:
<div id="errors" class="some class" style="display: block;">
<div class="some other class"></div>
<div class="some other class 2">
<span class="displayError">Please correct the errors listed in red below:</span>
<ul>
<li style="display:none;" id="invalidId">Enter a valid id</li>
<li style="display:list-item;" id="genericError">Something bad happened</li>
<li style="display:none;" id="somethingBlah" ............ </li>
....
</ul>
</div>
The correct XPath should be:
//*[#id='errors']//ul/li[contains(#style,'display:list-item')]
After //*[#id='errors'] you need an extra /, because <ul> is not directly beneath it. Using // again scans all underlying elements for <ul>.
If you are capable to not use // it would be better and faster and less consuming.
Related
having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL
I have noticed that using xpath axes methods sometimes return wrong nodes. I have two examples:
url: "http://demo.guru99.com/v1/"
<tr>
<td align="center">
<img src="../images/1.gif">
<img src="../images/3.gif">
<img src="../images/2.gif">
</td>
</tr>
I can select three img elements by axes methods "//td//child::img". However when I use "//td//following-sibling::img", it can still return the second and third img elements. As far as I know, child and sibling are two different thing, so why this happens?
url: http://demo.guru99.com/selenium/guru99home/
<div class="rt-grid-12 rt-alpha rt-omega" id="rt-feature">
<div class="rt-grid-6 ">
<div class="rt-block">
<h3>
Desktop, mobile, and tablet access</h3>
<ul>
<li>
<p>
Free android App</p>
</li>
<li>
<p>
Download any tutorial for free</p>
</li>
<li>
<p>
Watch video tutorials from anywhere </p>
</li>
</ul>
<p>
<img alt="" src="images/app_google_play(1).png"></p>
</div>
</div>
<div class="rt-grid-5 ">
<div class="rt-block">
<img src="images/logo_respnsivsite.png"><br>
</div>
</div>
</div>
Here, if I use "//div[#id='rt-feature' and (#class='rt-grid-12 rt-alpha rt-omega')]//following-sibling::div", those div elements which should be child elements are still be counted as siblings
Use "//div[#id='rt-feature' and (#class='rt-grid-12 rt-alpha rt-omega')]//parent::div", the self element and its child div elements are all counted as parent.
This cause me a lot of confusion, please help me.
Suggesting that the XPath parser returns the wrong nodes, rather than that you don't understand why it is returning what it does, is starting from the wrong mindset. Unless you know the XPath parser is unreliable, start with the assumption that it is right and your expectations are wrong. Then go to the spec and study the semantics of the expression you have written.
You will find that
//td//following-sibling::img
is an abbreviation for
/descendant-or-self::node()/td/descendant-or-self::node()/following-sibling::img
so you have asked for all the following siblings of all the descendants of all the td nodes, which is exactly what you are getting.
I've come across people who habitually write "//" in place of "/" as a sort of magic fairy dust without having the faintest idea what it means. Don't do it: read the spec.
I want to Select all the LI elements which contain SPAN with id="liveDeal152_dealPrice" as descendents. How do i do this with xpath?
Here is a sample html
<ul>
<li id="liveDeal_152">
<p class="price">
<em>|
<span class="WebRupee">₹ </span>
<span id="liveDeal152_dealPrice">495 </span>
</p>
</li>
<li id="liveDeal_152">
<p class="price">
<em>|
<span class="WebRupee">₹ </span>
(price hidden)
</p>
</li>
</ul>
//li[.//span[#id = 'liveDeal152_dealPrice']] should do. Or more verbose but closer to your textual description //li[descendant::span[#id = 'liveDeal152_dealPrice']].
Use this
//li[.//span[#id="liveDeal152_dealPrice"]]
It selects
ALL <li> ELEMENTS
//li[ ]
THAT HAVE A <span> DESCENDANT
.//span[ ]
WITH id ATTRIBUTE EQUAL TO "liveDeal152_dealPrice"
#id="liveDeal152_dealPrice"
That said, it doesn't seem like a very wise element selection, mostly due to the dynamically looking id. If you're going to use it once, it's probably ok, but if you're using it, say, for testing and will reuse it many times, it might cause trouble. Are you sure this won't change when you change your website and/or database?
As a side note:
ul stands for "unordered list"
ol stands for "ordered list"
li stands for "list item"
I am using Scrapy and have run into a few places where it would be nice to use variables, but I can't figure out how. Meaning if I have some long string it would be nice to store it in a variable long_string and then select for it: hxs.select('\\div[#id=long_string]').
I'm sure this is supported by Scrapy and I just can't figure it out as it wouldn't make sense for you to always have to hard-code the string in.
Update:
So for the sample text below I want to extract the div where id="footer":
<div id="footer">
<div id="footer-menu">
<div class="region-footer-menu">
<div id="block-menu-menu-footer-menu" class="block-menu">
<div class="content">
<ul class="menu">
<li class="first leaf">FAQs</li>
<li class="leaf">Media</li>
<li class="leaf">Partners</li>
<li class="last leaf active-trail">Jobs</li>
</ul>
</div>
</div>
<div id="block-block-52" class="block block-block">
<div class="content">
<p>SUPPORT</p>
</div>
</div>
</div>
</div>
</div>
We initialize hxs = HtmlXPathSelector(response) for all the below segments.
The following code selects only the first div:
hxs.select('//div[#id=concat("foot","er")]')
This code selects nothing but gives no error:
hxs.select('//div[#id="foot"+"er"]')
Both of the below code segments select nothing and give no errors:
long_string = "foot"
hxs.select('//div[#id=concat(long_string,"er")]')
hxs.select('//div[#id=long_string]')
I would like to be able to do either of the bottom two methods and return the desired results.
Assuming + works for string concatenation in Scrapy, this should work:
hxs.select('//div[#id="' + long_string + '"]')
I'm not familiar with Scrapy, but I don't think you'll be able to select a div that doesn't exist.
have you tried?
hxs.select('\\div[#id="' + long_string_variable + '"]')
today I stumbled upon a very interesting case (at least for me). I am messing around with Selenium and xPath and tried to get some elements, but got a strange behaviour:
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some other text
</a>
</div>
</div>
</div>
<div class="resultcontainer">
<div class="info">
<div class="title">
<a>
some even unrelated text
</a>
</div>
</div>
</div>
This is my data.
When i run the following xPath query:
//div[#class="title"][1]/a
I get as a result ALL instead of only the first one. But if I query:
//div[#class="resultcontainer"][1]/div[#class="info"]/div[#class="title"]/a
I get only the first , not all.
Is there some divine reason behind that?
Best regards,
bisko
I think you want
(//div[#class="title"])[1]/a
This:
//div[#class="title"][1]/a
selects all (<a> elements that are children of) <div> elements that have a #class of 'title', that are the first children of their parents (in this context). Which means: it selects all of them.
The working XPath selects all <div> elements that have a #class of 'title' - and of those it takes the first one.
The predicates (the expressions in square brackets []) are applied to each element that matched the preceding location step (i.e. "//div") individually. To apply a predicate to a filtered set of nodes, you need to make the grouping clear with parentheses.
Consequently, this:
//div[1][#class="title"]/a
would select all <div> elements, take the first one, and then filter it down futher by checking the #class value. Also not what you want. ;-)