XPATH Matching Sale Price or Regular Price - xpath

I need to match either the sale price (if on sale) or the regular price using one expression(hope that's the right term). Here's the two example HTML structures:
On Sale
<span class="price">
<del>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
14.99
</span>
</del>
<ins>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
12.99
</span>
</ins>
Regular Price
<p class="price">
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
25.00
</span>
</p>
The expression I have so far is:
//*[#class="woocommerce-Price-amount amount"][last()]
It matches on both scenarios but returns both regular and sale prices for the "On Sale" scenario. Do I need some conditional to only return the sale price?
I thought I could possibly only return the last [#class="woocommerce-Price-amount amount"]. I tried last-child but wasn't fully comprehending.

First, note that in your "on sale" snippet, you are missing a closing <span>, but if we fix it, this convoluted expression should do the trick.
It's formatted for easier reading:
//span[#class="woocommerce-Price-amount amount"]
[parent::p
or
parent::ins
]
Try it on your actual code and see if it works.

Related

How to get specific xpath tag value

<div class="container">
<span class="price">
<bdi> 140 </bdi>
</span>
<span class="price">
<del>
<bdi>90</bdi>
</del>
<ins>
<bdi> 120 </bdi>
</ins>
</span>
</div>
I want to scrape a site which html formatting like below. Here I dont want to bdi tag value which is under del tag and want bdi tag value which is under span class and ins tag. Is there any path to figure it out?
Don't pretty much usual //span/ins/bdi/text() work for you?
This is "text of <bdi> which parent is <ins> which parent is <span>"?
CSS variant span>ins>bdi::text should also work I suppose.
Sorry, haven't noticed that you need two values. In that case .xpath('//bdi[not(parent::del)]/text()').extract() will work well.

Make XPath stop at a certain depth?

I have the following HTML
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
</span>
I want an XPath that only gets the text 09:00, not Some more text NOT using text()[1] because that causes other problems. My current XPath looks like this
("//span[1][contains(#class, 'day-time-clock')]/text()")
I want one that ignores this whole part of the HTML
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
You can limit the level of descendant:: nodes with position().
So the following expression does work:
span/descendant::node()[2 > position()]
Adjust the number in the predicate to your needs, 2 is only an example. A disadvantage of this approach is that the counting of the descendants is only accurate for the first child in the descending tree.
Another approach is limiting the both: the ancestors and the descendants:
span/descendant::node()[3 > count(ancestor::*) and 1 > count(descendant::*)]
Here, too, you have to adjust the numbers in the predicates to get any useful results.
Use normalize-space() for select all non-whitespace nodes of the document:
//span[contains(#class, 'day-time-clock')]/text()[normalize-space()]
I think (if I understand you correctly) that
"..//div[contains(#class, 'tooltip-box')]/parent::span"
gets you there.

Getting two lists using Xpath that are both contained in the same container

In the code sample below I'm looking to extract, using Xpath inside of Scrapy, first from list 1 and then from list 2. Some items may be linked out while others are just items in the list. What I need is two strings (or lists) one for List 1 and one for List 2
<div class="row">
<div class="col-xs-12 no-padding-xs">
<h3 class="text-primary gutter-xs">List 1</h3>
<div class="well well-sm">
Miniature, Mustang, Paint Pony, Pinto, Pony, POA, Quarter Pony, Shetland Pony, Spanish Mustang
</div>
<h3 class="text-primary gutter-xs">List 2</h3>
<div class="well well-sm">
All Around, Driving, Halter, Lesson, Natural Horsemanship, Show, Trail Riding, Western Pleasure, Western Riding, Youth, Champion Trainer, POA Ponies for Sale, Newaygo County, Horse Boarding, Equestrian Coaching, Michigan, Riding Lessons, Horse Lea
</div>
</div>
</div>
Not sure that I understood you properly, but you can try:
from w3lib.html import remove_tags
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print [remove_tags(i).strip() for i in div_data.split(',')]
Or if you want just strings:
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print remove_tags(div_data)

ImportXML on Google Sheets - how to get a user variable (kind of)?

After managing to import the filmography for any actor on rateyourmusic.com via
=importxml("https://rateyourmusic.com/films/cary_grant/","//li")
I couldn't figure out how to retrieve my own user rating for certain titles (which would also tell me which title in the list I've already seen).
As I'm still learning my ropes around the importxml command, all I found out is that they're under the 'film_cat_catalog_msg_1050' Xpath identifier(?), but fiddling with said command, all I could get on a separate column on my spreadsheet, was the standard 'rate' word so far - but no personal rating.
Could anyone help me with that, please?
<li><span onclick="RYMartistPage.openFilmCataloger(1050);" class="disco_cat_inner"><span class="disco_cat_catalog_msg"><i class="fa fa-caret-left"></i> </span> <span id="film_cat_catalog_msg_1050">4.5</span></span><div id="film_cataloger_1050" class="film_cataloger"><div class="film_cataloger_close" onclick="RYMartistPage.collapseFilmCataloger(1050);"><i class="fa fa-caret-right"></i> </div> <div id="film_cataloger_content_1050" class="film_cataloger_content"></div></div>
<div class="has_tip film_rel_img delayed_discography_img" data-delayloadurl="url('//e.snmc.io/lk/m/l/45956edc922ce07e2b84a6ff23da3452/6152891.jpg')" data-delayloadurl2x="url('//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg')" style="background: rgba(0, 0, 0, 0) url("//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg") repeat scroll 0% 0%;"></div><div class="disco_avg_rating">3.81</div><div class="disco_ratings">1,063</div><div class="disco_reviews">25</div> <div class="film_info">
<div class="film_mainline recommended">
<a title="[Film1050]" href="/film/his_girl_friday/" class="film">His Girl Friday</a>
</div>
<div class="film_subline">
<span title="18 January 1940 " class="disco_year_ymd">1940</span> • Walter Burns
</div>
</div></li>
As you have to be logged in in order to see said ratings, here's a screenshot for those who aren't members:
rateyourmusic.com filmography
Try it with this XPath query:
//span[#id="film_cat_catalog_msg_1050"]
Demo
As you have already guessed, we need something like starts-with since the numeric part is acutally variable:
//span[starts-with(#id, "film_cat_catalog_msg_")]
Demo 2
And putting it all together:
=importxml("https://rateyourmusic.com/films/cary_grant/","//span[starts-with(#id, 'film_cat_catalog_msg_')]")

Wrap lines with tag using different logic in sublime text 2

I have hundreds of list items to code. each list item contains title and description in 2 lines. so what i need to do is wrap 2 lines with a tag. is there any way to do so using sublime text 2? i am using windows OS.
this is the output needed:
<ul>
<li>
this is the title
this is the descrpition
</li>
<li>
this is the title
this is the descrpition
</li>
</ul>
raw text looks like this:
this is title
this is description
this is title
this is description
=====
i have tried using ctrl+shift+G and using ul>li* but unfortunately it wraps each line with <li>
if it is possible with sublime text, i actually need this type of structure:
<ul>
<li>
<span class="title">this is the title</span>
<span class="description">this is the descrpition</span>
</li>
<li>
<span class="title">this is the title</span>
<span class="description">this is the descrpition</span>
</li>
</ul>
How about a two step process using find and replace?
I am assuming that:
your original text is not indented at all;
your indentation is two spaces; and
you will handle the wrap with <ul> and resultant indentation yourself after this is done.
Original state:
this is title
this is description
this is title
this is description
Step one
Ensuring you have enabled regular expression matching do a find and replace using these values.
FIND WHAT :: ((.*\n){1,2})
REPLACE WITH :: <li>\n\1</li>\n
Result:
<li>
this is title
this is description
</li>
<li>
this is title
this is description
</li>
Step two
Ensuring you have enabled regular expression matching do a find and replace using these values.
FIND WHAT :: (<li>\n)(.*)\n(.*)
REPLACE WITH :: \1 <span class="title">\2</span>\n <span class="description">\3</span>
Result:
<li>
<span class="title">this is title</span>
<span class="description">this is description</span>
</li>
<li>
<span class="title">this is title</span>
<span class="description">this is description</span>
</li>
What do you think?
Close enough to be useful?

Resources