Getting two lists using Xpath that are both contained in the same container - xpath

In the code sample below I'm looking to extract, using Xpath inside of Scrapy, first from list 1 and then from list 2. Some items may be linked out while others are just items in the list. What I need is two strings (or lists) one for List 1 and one for List 2
<div class="row">
<div class="col-xs-12 no-padding-xs">
<h3 class="text-primary gutter-xs">List 1</h3>
<div class="well well-sm">
Miniature, Mustang, Paint Pony, Pinto, Pony, POA, Quarter Pony, Shetland Pony, Spanish Mustang
</div>
<h3 class="text-primary gutter-xs">List 2</h3>
<div class="well well-sm">
All Around, Driving, Halter, Lesson, Natural Horsemanship, Show, Trail Riding, Western Pleasure, Western Riding, Youth, Champion Trainer, POA Ponies for Sale, Newaygo County, Horse Boarding, Equestrian Coaching, Michigan, Riding Lessons, Horse Lea
</div>
</div>
</div>

Not sure that I understood you properly, but you can try:
from w3lib.html import remove_tags
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print [remove_tags(i).strip() for i in div_data.split(',')]
Or if you want just strings:
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print remove_tags(div_data)

Related

XPATH Matching Sale Price or Regular Price

I need to match either the sale price (if on sale) or the regular price using one expression(hope that's the right term). Here's the two example HTML structures:
On Sale
<span class="price">
<del>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
14.99
</span>
</del>
<ins>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
12.99
</span>
</ins>
Regular Price
<p class="price">
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
25.00
</span>
</p>
The expression I have so far is:
//*[#class="woocommerce-Price-amount amount"][last()]
It matches on both scenarios but returns both regular and sale prices for the "On Sale" scenario. Do I need some conditional to only return the sale price?
I thought I could possibly only return the last [#class="woocommerce-Price-amount amount"]. I tried last-child but wasn't fully comprehending.
First, note that in your "on sale" snippet, you are missing a closing <span>, but if we fix it, this convoluted expression should do the trick.
It's formatted for easier reading:
//span[#class="woocommerce-Price-amount amount"]
[parent::p
or
parent::ins
]
Try it on your actual code and see if it works.

ImportXML on Google Sheets - how to get a user variable (kind of)?

After managing to import the filmography for any actor on rateyourmusic.com via
=importxml("https://rateyourmusic.com/films/cary_grant/","//li")
I couldn't figure out how to retrieve my own user rating for certain titles (which would also tell me which title in the list I've already seen).
As I'm still learning my ropes around the importxml command, all I found out is that they're under the 'film_cat_catalog_msg_1050' Xpath identifier(?), but fiddling with said command, all I could get on a separate column on my spreadsheet, was the standard 'rate' word so far - but no personal rating.
Could anyone help me with that, please?
<li><span onclick="RYMartistPage.openFilmCataloger(1050);" class="disco_cat_inner"><span class="disco_cat_catalog_msg"><i class="fa fa-caret-left"></i> </span> <span id="film_cat_catalog_msg_1050">4.5</span></span><div id="film_cataloger_1050" class="film_cataloger"><div class="film_cataloger_close" onclick="RYMartistPage.collapseFilmCataloger(1050);"><i class="fa fa-caret-right"></i> </div> <div id="film_cataloger_content_1050" class="film_cataloger_content"></div></div>
<div class="has_tip film_rel_img delayed_discography_img" data-delayloadurl="url('//e.snmc.io/lk/m/l/45956edc922ce07e2b84a6ff23da3452/6152891.jpg')" data-delayloadurl2x="url('//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg')" style="background: rgba(0, 0, 0, 0) url("//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg") repeat scroll 0% 0%;"></div><div class="disco_avg_rating">3.81</div><div class="disco_ratings">1,063</div><div class="disco_reviews">25</div> <div class="film_info">
<div class="film_mainline recommended">
<a title="[Film1050]" href="/film/his_girl_friday/" class="film">His Girl Friday</a>
</div>
<div class="film_subline">
<span title="18 January 1940 " class="disco_year_ymd">1940</span> • Walter Burns
</div>
</div></li>
As you have to be logged in in order to see said ratings, here's a screenshot for those who aren't members:
rateyourmusic.com filmography
Try it with this XPath query:
//span[#id="film_cat_catalog_msg_1050"]
Demo
As you have already guessed, we need something like starts-with since the numeric part is acutally variable:
//span[starts-with(#id, "film_cat_catalog_msg_")]
Demo 2
And putting it all together:
=importxml("https://rateyourmusic.com/films/cary_grant/","//span[starts-with(#id, 'film_cat_catalog_msg_')]")

How do exclude elements from an Xpath query?

I'm trying to select the ingredients in an ingredients list, but there are also tooltips scattered amongst them (on the BBC Good Food site).
As a stripped-down example:
<li class="ingredients-list__item" itemprop="ingredients">
400g
<a href="/glossary/new-potatoes" class="ingredients-list__glossary-link tooltip-processed">
new potato
<div id="gf-tooltip-0" class="gf-tooltip" role="tooltip">
<div class="gf-tooltip__content">
<div class="gf-tooltip__text">
<p>unwanted tooltip</p>
</div>
</div>
</div>
</a>, halved if large
<span class="ingredients-list__glossary-element" id="ingredients-glossary"></span>
</li>
I'm trying to end up with '400g new potato, halved if large', or equally good, ['400g', 'new potato', ', halved if large'].
Amongst other things I've tried:
s.xpath("//li[#class='ingredients-list__item'][not(div[#class='gf-tooltip'])]//text()").extract()
But this still returns the text in the tooltip div.
One possible way would be excluding text nodes where any of the ancestor is a tooltip div (broken into 2 lines for readability) :
//li[#class='ingredients-list__item']
//text()[not(ancestor::div[#class='gf-tooltip'])]

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

Xpth extract plain email text

I'm trying to extract the email text from a list but without success.
In particular I've used this code
//li/div/p//*[contains(., '#')]
but strangely it doesn't work! Could you help me?
Here's the code exemple
<li class="bgmp_list-item">
<h3 class="bgmp_list-placemark-title">
Name1
</h3>
<div class="bgmp_list-description">
<p class="">
<strong class="">Responsible:</strong> John Doe <br>
<strong class="">Site:</strong> <a title="www.exemple.com" href="http://www.exemple.com" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.2ld.it']);" target="_blank" class="">www.2ld.it</a>
<br>
<strong class="">Email:</strong> some_email#email.com
<br><strong class="">Address:</strong> 3, Main Street 00000, London <br>
<strong>Tel:</strong> 00 000000 <strong>Fax:</strong> 0000000
</p>
</div>
You're almost there but not quite. For the sample code the correct xpath would be
//p/text()[contains(.,'#')]
Not to reinvent the wheel here is a very good explanation on it on another answer
By using p//*[contains(., '#')] you apply the predicate on individual child elements of <p>, while there is no such child element because
the target email address text is direct child of <p>. This is one of the reason why the intial XPath didn't work. Applying the predicate on <p> directly should work :
//li/div/p[contains(., '#')]
but that will return the <p> element. If you need to return only the text node that contains email address, then the predicate should be applied on individual text nodes within <p>, as mentioned in the other answer :
//li/div/p/text()[contains(., '#')]

Resources