XPath query for Google importXML function - xpath

I'm trying to write an xpath query to import some content from a webpage in google spreadsheets using importXML function. I need to capture % of Buy under COMMUNITY SENTIMENTS on the below webpage:
https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI
This % was showing 73% at time of posting this message, but may change later. (So I need to import 73% in my Google sheets).
Relevant HTML code of this page has below script:
</script>
<ul class="buy_sellper">
<li><span result="73" class="bullet_clr buy buy_results"></span>73% BUY</a></li>
<li><span result="20" class="bullet_clr sell sell_results"></span>20% SELL</a></li>
<li><span result="7" class="bullet_clr hold hold_results"></span>7% HOLD</a></li>
</ul>
</div>
</div>
<div class="chart_fr ">
<div class="txt_pernbd">73%</div>
<div class="cht_mt25">of moneycontrol users recommend <span class=green_txt>buying</span> SBI</div>
</div>
<!-- buy, sell, hold starts -->
<div class="buy-sell-hold">
<p>What's your call on SBI today?</p>
<p>
Using Chrome, I used "inspect element" function and then "copy xpath" which gave me the following....
//*[#id="MshareElement"]
But this is not getting any results when I use in Google sheets with importxml function. I have zero knowledge of programming and I am trying to learn web scraping techniques using this function.
Please help.

try:
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'txt_pernbd'"), ">(.+?)<")
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'bullet_clr sell sell_results'"), "/span>(.+?)</a")
=REGEXEXTRACT(QUERY(IMPORTDATA(
"https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI"),
"where lower(Col1) contains 'investor views'"), ">(.+?)<")

Related

XPath valid in Firefox but not in Chrome

I am trying to find a menu element via XPath in the JupyterLab UI; The following is an extract of the list of elements in the menu I am interested in, and should be a good minimal example of my problem:
<li tabindex="0" aria-disabled="true" role="menuitem" class="lm-Menu-item p-Menu-item lm-mod-disabled p-mod-disabled lm-mod-hidden p-mod-hidden" data-type="command" data-command="filemenu:logout">
<div class="f1vya9e0 lm-Menu-itemIcon p-Menu-itemIcon jp-Icon"></div>
<div class="lm-Menu-itemLabel p-Menu-itemLabel">Log Out</div>
<div class="lm-Menu-itemShortcut p-Menu-itemShortcut"></div>
<div class="lm-Menu-itemSubmenuIcon p-Menu-itemSubmenuIcon"></div>
</li>
<li tabindex="0" role="menuitem" class="lm-Menu-item p-Menu-item" data-type="command" data-command="hub:logout"><div class="f1vya9e0 lm-Menu-itemIcon p-Menu-itemIcon jp-Icon">
<div class="f1vya9e0 lm-Menu-itemIcon p-Menu-itemIcon jp-Icon"></div>
<div class="lm-Menu-itemLabel p-Menu-itemLabel">Log Out</div>
<div class="lm-Menu-itemShortcut p-Menu-itemShortcut"></div>
<div class="lm-Menu-itemSubmenuIcon p-Menu-itemSubmenuIcon"></div>
</li>
As you can see, both <li> items contain a <div> with the text Log Out, which is my main problem, as I am trying to write a general Xpath expression that can work for any Menu item. What I am currently trying to use is:
//div[contains(#class, 'p-Menu-itemLabel')][text() = '${item}']
Where ${item} can be any menu item, as all <li> items will have a similar div with text in them. The problem arises with the Log Out item, which is the only one that is repeated twice. In order to handle this special case, I have though of using
//div[contains(#class, 'p-Menu-itemLabel')][text() = 'Log Out']/..[not(contains(#class,'p-mod-hidden'))]
Since either one of the two <li> items will not contain that specific class (i.e., the currently active Log Out element).
This XPath works fine in Firefox and finds the element I am looking for everytime, however Chrome complains that it is not a valid XPath expression. Somehow this reduced version:
//div[contains(#class, 'p-Menu-itemLabel')][text() = 'Log Out']/..
works in Chrome, but any time I try to use an attribute selector on the parent element (i.e. /..[something]) it fails to recognize it as a valid XPath.
Does anyone have any idea of why? And what can I do to make Chrome recognize it as a valid XPath?
It seems that Chrome doesn't like applying a predicate directly from the .. parent axis.
But you can modify to use the long form: parent::*
//div[contains(#class, 'p-Menu-itemLabel')][text() = 'Log Out']/parent::*[not(contains(#class,'p-mod-hidden'))]
Or apply the self::* axis and then apply the predicate:
//div[contains(#class, 'p-Menu-itemLabel')][text() = 'Log Out']/../self::*[not(contains(#class,'p-mod-hidden'))]

To compare selenium xpath values

You are trying to run xpath values by comparing them.
You want to compare whether there are comments or not.
<div class="media-body">
<a href="https://url" class="ellipsis">
<span class="pull-right count orangered">
+26 </span>
post title </a>
<div class="media-info ellipsis">
admin <i class="fa fa-clock-o"></i> date </div>
</div>
If there is a comment, span class="full-right count or changed" is generated. If you don't have it, it won't be produced.
xpath comment //*[#id="thema_wrapper"]/div[3]/div/div/div[3]/div/div[7]/div[2]/div[1]/div[2]/a/span
xpath nocomment //*[#id="thema_wrapper"]/div[3]/div/div/div[3]/div/div[7]/div[2]/div[1]/div[2]/a/
I think we can compare this with if,else,but I don't know how.
if
#nocomment start
else
#comment stop
I searched a lot for the data, but I couldn't find it. Please help me.
Here's an XPath example to select/click on something without comment. This website seems to use the same system as your sample data :
http://cineaste.co.kr/
To select the entries with no comment for the movies block ("영화이야기"), just use :
//h3[.="영화이야기"]/following::div[#class="widget-small-box"][1]//li[#class="ellipsis"][not(contains(.,"+"))]
We verify the presence of the "+" in the li node to filter the data.
Oh, it's the same system. I tested it and there was an error.
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//h3[.='영화이야기']/following::div[#class='widget-small-box'][1]//li[#class='ellipsis'][not(contains(.,'+'))]"}
(Session info: chrome=81.0.4044.138)
from selenium import webdriver
import time
path = "C:\chromed\chromedriver.exe"
driver = webdriver.Chrome(path) #path
'''
'''
driver.get("http://cineaste.co.kr/") #url
time.sleep(0.5)
postclick = driver.find_element_by_xpath("//h3[.='영화이야기']/following::div[#class='widget-small-box'][1]//li[#class='ellipsis'][not(contains(.,'+'))]") #로그인창 활성화
postclick.click()
driver.close()
Could you make an example with the site? I want to ignore the posts with comments and just click the ones without comments.

ImportXML on Google Sheets - how to get a user variable (kind of)?

After managing to import the filmography for any actor on rateyourmusic.com via
=importxml("https://rateyourmusic.com/films/cary_grant/","//li")
I couldn't figure out how to retrieve my own user rating for certain titles (which would also tell me which title in the list I've already seen).
As I'm still learning my ropes around the importxml command, all I found out is that they're under the 'film_cat_catalog_msg_1050' Xpath identifier(?), but fiddling with said command, all I could get on a separate column on my spreadsheet, was the standard 'rate' word so far - but no personal rating.
Could anyone help me with that, please?
<li><span onclick="RYMartistPage.openFilmCataloger(1050);" class="disco_cat_inner"><span class="disco_cat_catalog_msg"><i class="fa fa-caret-left"></i> </span> <span id="film_cat_catalog_msg_1050">4.5</span></span><div id="film_cataloger_1050" class="film_cataloger"><div class="film_cataloger_close" onclick="RYMartistPage.collapseFilmCataloger(1050);"><i class="fa fa-caret-right"></i> </div> <div id="film_cataloger_content_1050" class="film_cataloger_content"></div></div>
<div class="has_tip film_rel_img delayed_discography_img" data-delayloadurl="url('//e.snmc.io/lk/m/l/45956edc922ce07e2b84a6ff23da3452/6152891.jpg')" data-delayloadurl2x="url('//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg')" style="background: rgba(0, 0, 0, 0) url("//e.snmc.io/lk/t/l/48b945e1a503ab7a9dce538a50fa9b99/6152891.jpg") repeat scroll 0% 0%;"></div><div class="disco_avg_rating">3.81</div><div class="disco_ratings">1,063</div><div class="disco_reviews">25</div> <div class="film_info">
<div class="film_mainline recommended">
<a title="[Film1050]" href="/film/his_girl_friday/" class="film">His Girl Friday</a>
</div>
<div class="film_subline">
<span title="18 January 1940 " class="disco_year_ymd">1940</span> • Walter Burns
</div>
</div></li>
As you have to be logged in in order to see said ratings, here's a screenshot for those who aren't members:
rateyourmusic.com filmography
Try it with this XPath query:
//span[#id="film_cat_catalog_msg_1050"]
Demo
As you have already guessed, we need something like starts-with since the numeric part is acutally variable:
//span[starts-with(#id, "film_cat_catalog_msg_")]
Demo 2
And putting it all together:
=importxml("https://rateyourmusic.com/films/cary_grant/","//span[starts-with(#id, 'film_cat_catalog_msg_')]")

XPath Exclude Text From Child Element

I'm looking to get the output:
50ml milk
From the following code:
<ul class="ingredients-list__group">
<li>50ml <a href="/glossary/milk" class="tooltip-processed">milk
<div class="tooltip">
<h2
class="node-title">Milk</h2> <span class="fonetic">mill-k</span>
<p>One of the most widely used ingredients, milk is often referred to as a complete food. While cow…</p>
</div>
</a>
</li>
</ul>
Currently I'm using the XPATH:
//ul[#class="ingredients-list__group"]/li
But getting:
50ml milk Milk mill-kOne of the most widely used ingredients, milk is often referred to as a complete food. While cow…
How do I exclude the stuff within the div/tooltip?
With xpath 2.0:
//ul[#class="ingredients-list__group"]/li/concat(./text()[1], ./a/text()[1])
With xpath 1.0:
concat(//ul[#class="ingredients-list__group"]/li/text()[1], //ul[#class="ingredients-list__group"]/li/a/text()[1])'
You can select the relevant text nodes using
//ul[#class="ingredients-list__group"]//
text()[not(ancestor::div[#class='tooltip'])]
If you're in XPath 2.0 you can then put this in a call of string-join() to join these into a single string. If you're stuck with 1.0, you'll have to return multiple text nodes to the calling application and concatenate them together in the host language code.

XPath expression to select text not in paragraph

I'm deveoping web scraping scoftware that relies on XPath to extract information from web pages.
One application of the software is to scrape reviews of shows from websites. One page I'm trying to scrape is the Guardian's latest Edinburgh festival reviews: http://www.guardian.co.uk/culture/edinburghfestival+tone/reviews
The section I want is at the bottom, titled "Most recent". The XPath expression for the list of review items (that is the pic, the stars, the date, the blurb, etc) is
//ul[#id='auto-trail-block']
which returns a list of li elements, each corresponding to one review item.
If I want to refer to only the blurb, the closest I can get is to say
//ul[#id='auto-trail-block']/div[#class='trailtext']
but when I collect the text content from each item of the list, it includes lots of Javascript and nasty stuff I don't need. I can't refer to the blurb itself because it is not inside a p element, but within a div element that contains script elements and strong elements that contain javascript and unrelated text respectively.
In the debugger it the DOM looks like this:
<ul id="auto-trail-block" ...>
<li ...>
<div ...>
<div ...>
<div ...>
<div class="trailtext">
<script ...>
<div ...>
<span ...>
<strong .../>
<br/>
The Text I want to copy!
<strong .../>
<a .../>
<div .../>
</div>
</div>
</li>
<li ...>
...
</li>
...
</ul>
Is there any way to refer to the text content contained in just the div and not any of its subelements?
My approach would be to select the trailtext div, remove the script tags with their content and all HTML tags. What's left would be the content you want.
Just wondering - what does the inner text node of //ul[#id='auto-trail-block']/div[#class='trailtext'] return? I would guess mostly the blurb, so clearing out the script tags should almost get you there.
If you only want the text node children of div[#class='trailtext'], then use text()
//ul[#id='auto-trail-block']//div[#class='trailtext']/text()

Resources