ImportXML on Google Sheets - how to get a user variable (kind of)? - xpath

After managing to import the filmography for any actor on rateyourmusic.com via
=importxml("https://rateyourmusic.com/films/cary_grant/","//li")
I couldn't figure out how to retrieve my own user rating for certain titles (which would also tell me which title in the list I've already seen).
As I'm still learning my ropes around the importxml command, all I found out is that they're under the 'film_cat_catalog_msg_1050' Xpath identifier(?), but fiddling with said command, all I could get on a separate column on my spreadsheet, was the standard 'rate' word so far - but no personal rating.
Could anyone help me with that, please?
<li><span onclick="RYMartistPage.openFilmCataloger(1050);" class="disco_cat_inner"><span class="disco_cat_catalog_msg"><i class="fa fa-caret-left"></i> </span> <span id="film_cat_catalog_msg_1050">4.5</span></span><div id="film_cataloger_1050" class="film_cataloger"><div class="film_cataloger_close" onclick="RYMartistPage.collapseFilmCataloger(1050);"><i class="fa fa-caret-right"></i> </div> <div id="film_cataloger_content_1050" class="film_cataloger_content"></div></div>
<div class="has_tip film_rel_img delayed_discography_img" data-delayloadurl="url('//e.snmc.io/lk/m/l/45956edc922ce07e2b84a6ff23da3452/6152891.jpg')" data-delayloadurl2x="url('//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg')" style="background: rgba(0, 0, 0, 0) url("//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg") repeat scroll 0% 0%;"></div><div class="disco_avg_rating">3.81</div><div class="disco_ratings">1,063</div><div class="disco_reviews">25</div> <div class="film_info">
<div class="film_mainline recommended">
<a title="[Film1050]" href="/film/his_girl_friday/" class="film">His Girl Friday</a>
</div>
<div class="film_subline">
<span title="18 January 1940 " class="disco_year_ymd">1940</span> • Walter Burns
</div>
</div></li>
As you have to be logged in in order to see said ratings, here's a screenshot for those who aren't members:
rateyourmusic.com filmography

Try it with this XPath query:
//span[#id="film_cat_catalog_msg_1050"]
Demo
As you have already guessed, we need something like starts-with since the numeric part is acutally variable:
//span[starts-with(#id, "film_cat_catalog_msg_")]
Demo 2
And putting it all together:
=importxml("https://rateyourmusic.com/films/cary_grant/","//span[starts-with(#id, 'film_cat_catalog_msg_')]")

Related

How to get specific xpath tag value

<div class="container">
<span class="price">
<bdi> 140 </bdi>
</span>
<span class="price">
<del>
<bdi>90</bdi>
</del>
<ins>
<bdi> 120 </bdi>
</ins>
</span>
</div>
I want to scrape a site which html formatting like below. Here I dont want to bdi tag value which is under del tag and want bdi tag value which is under span class and ins tag. Is there any path to figure it out?
Don't pretty much usual //span/ins/bdi/text() work for you?
This is "text of <bdi> which parent is <ins> which parent is <span>"?
CSS variant span>ins>bdi::text should also work I suppose.
Sorry, haven't noticed that you need two values. In that case .xpath('//bdi[not(parent::del)]/text()').extract() will work well.

XPath query for Google importXML function

I'm trying to write an xpath query to import some content from a webpage in google spreadsheets using importXML function. I need to capture % of Buy under COMMUNITY SENTIMENTS on the below webpage:
https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI
This % was showing 73% at time of posting this message, but may change later. (So I need to import 73% in my Google sheets).
Relevant HTML code of this page has below script:
</script>
<ul class="buy_sellper">
<li><span result="73" class="bullet_clr buy buy_results"></span>73% BUY</a></li>
<li><span result="20" class="bullet_clr sell sell_results"></span>20% SELL</a></li>
<li><span result="7" class="bullet_clr hold hold_results"></span>7% HOLD</a></li>
</ul>
</div>
</div>
<div class="chart_fr ">
<div class="txt_pernbd">73%</div>
<div class="cht_mt25">of moneycontrol users recommend <span class=green_txt>buying</span> SBI</div>
</div>
<!-- buy, sell, hold starts -->
<div class="buy-sell-hold">
<p>What's your call on SBI today?</p>
<p>
Using Chrome, I used "inspect element" function and then "copy xpath" which gave me the following....
//*[#id="MshareElement"]
But this is not getting any results when I use in Google sheets with importxml function. I have zero knowledge of programming and I am trying to learn web scraping techniques using this function.
Please help.
try:
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'txt_pernbd'"), ">(.+?)<")
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'bullet_clr sell sell_results'"), "/span>(.+?)</a")
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'investor views'"), ">(.+?)<")

XPATH Matching Sale Price or Regular Price

I need to match either the sale price (if on sale) or the regular price using one expression(hope that's the right term). Here's the two example HTML structures:
On Sale
<span class="price">
<del>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
14.99
</span>
</del>
<ins>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
12.99
</span>
</ins>
Regular Price
<p class="price">
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">$</span>
25.00
</span>
</p>
The expression I have so far is:
//*[#class="woocommerce-Price-amount amount"][last()]
It matches on both scenarios but returns both regular and sale prices for the "On Sale" scenario. Do I need some conditional to only return the sale price?
I thought I could possibly only return the last [#class="woocommerce-Price-amount amount"]. I tried last-child but wasn't fully comprehending.
First, note that in your "on sale" snippet, you are missing a closing <span>, but if we fix it, this convoluted expression should do the trick.
It's formatted for easier reading:
//span[#class="woocommerce-Price-amount amount"]
[parent::p
or
parent::ins
]
Try it on your actual code and see if it works.

Make XPath stop at a certain depth?

I have the following HTML
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
</span>
I want an XPath that only gets the text 09:00, not Some more text NOT using text()[1] because that causes other problems. My current XPath looks like this
("//span[1][contains(#class, 'day-time-clock')]/text()")
I want one that ignores this whole part of the HTML
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some more text
</div>
</div>
You can limit the level of descendant:: nodes with position().
So the following expression does work:
span/descendant::node()[2 > position()]
Adjust the number in the predicate to your needs, 2 is only an example. A disadvantage of this approach is that the counting of the descendants is only accurate for the first child in the descending tree.
Another approach is limiting the both: the ancestors and the descendants:
span/descendant::node()[3 > count(ancestor::*) and 1 > count(descendant::*)]
Here, too, you have to adjust the numbers in the predicates to get any useful results.
Use normalize-space() for select all non-whitespace nodes of the document:
//span[contains(#class, 'day-time-clock')]/text()[normalize-space()]
I think (if I understand you correctly) that
"..//div[contains(#class, 'tooltip-box')]/parent::span"
gets you there.

Getting two lists using Xpath that are both contained in the same container

In the code sample below I'm looking to extract, using Xpath inside of Scrapy, first from list 1 and then from list 2. Some items may be linked out while others are just items in the list. What I need is two strings (or lists) one for List 1 and one for List 2
<div class="row">
<div class="col-xs-12 no-padding-xs">
<h3 class="text-primary gutter-xs">List 1</h3>
<div class="well well-sm">
Miniature, Mustang, Paint Pony, Pinto, Pony, POA, Quarter Pony, Shetland Pony, Spanish Mustang
</div>
<h3 class="text-primary gutter-xs">List 2</h3>
<div class="well well-sm">
All Around, Driving, Halter, Lesson, Natural Horsemanship, Show, Trail Riding, Western Pleasure, Western Riding, Youth, Champion Trainer, POA Ponies for Sale, Newaygo County, Horse Boarding, Equestrian Coaching, Michigan, Riding Lessons, Horse Lea
</div>
</div>
</div>
Not sure that I understood you properly, but you can try:
from w3lib.html import remove_tags
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print [remove_tags(i).strip() for i in div_data.split(',')]
Or if you want just strings:
for list_text in ['List 1', 'List 2']:
div_data = response.xpath('//h3[text()="{}"]/following-sibling::div[1]'.format(list_text)).get()
if not div_data:
continue
print remove_tags(div_data)

Resources