Google Sheets IMPORTXML XPath help (understanding how to read a page source) - xpath

I am trying to write a function that will give me the annual payout dividend for a given stock. The website I am using is www.seekingalpha.com
So I understand that the function is =IMPORTXML (URL, xpath_query).
In that case, my URL is: https://seekingalpha.com/symbol/VOO/dividends/scorecard
but the problem I am having is figuring out the correct XPath to acquire the dividend value.
I currently have this as my function:
=IMPORTXML(CONCATENATE("https://www.seekingalpha.com/symbol/", $B2, "/dividends/scorecard"), "//body")
$B2 is a cell that holds the ticker symbol if you are wondering. Anyways, I right-clicked the number I wanted from the website and followed it downstream and tried seeing where it is nested under but keep running into the wrong "directory" per se, because I am usually left with an error "Empty."
I have also tried copying the xPath directly:
/html/body/div[2]/div[1]/div/div[1]/div/div/div[2]/section/section[1]/table/tbody/tr/td[1]
but am greeted with another empty field error.
Could anyone point me in the right direction? I've been researching this for a while and figured this would be a great way to learn. Thank you in advance

you need some other source. Google Sheets does not support scraping of JavaScript elements. you can test JS dependency simply by disabling JS for given site and what's left can be scraped. in your case its nothing:
UPDATE:
=INDEX(IMPORTXML("https://stocknews.com/stock/"&A15&"/dividends/",
"//div[#class='grade-cat-ytd']"), 2)
understanding how to read a page source

Related

tool for extracting xpath query from speciifed/selected node

Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/

Getting a xPath from XML document

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Problem finding correct Selector for response.xpath or response.css in Scrapy on Coinmarketcap

i´d like to loop over the top20 exchanges on coinmarketcap to crawl the tables, e.g. https://coinmarketcap.com/exchanges/fatbtc/
Now i spent a few hours in finding the Selector, e.g. for Price
In Scrapy Shell i tried ... and many more, but all not working:
from Addon XPath Helper:
response.xpath('/html/body/div[#id='__next']/div[#class='cmc-app-wrapper.cmc-app-wrapper--env-prod.sc-1mezg3x-0.fUoBLh']/div[#class='container.cmc-main-section']/div[#class='cmc-main-section__content']/div[#class='cmc-exchanges.sc-1tluhf0-0.wNRWa']/div[#class='cmc-details-panel-table.sc-3klef5-0.cSzKTI']/div[#class='cmc-markets-listing.lgsxp9-0.eCrwnv']/div[#class='cmc-table.sc-1yv6u5n-0.dNLqEp']/div[#class='cmc-table__table-wrapper-outer']/div/table/tbody/tr[#class='cmc-table-row.sc-1ebpa92-0.kQmhAn'][1]/td[#class='cmc-table__cell.cmc-table__cell--sortable.cmc-table__cell--right.cmc-table__cell--sort-by__price']').getall()
from Chrome Inspector:
response.xpath('/td[#class='cmc-table__cell.cmc-table__cell--sortable.cmc-table__cell--right.cmc-table__cell--sort-by__price']').getall()
from Chrome Inspector copy XPath:
:
response.xpath('//*[#id="__next"]/div/div[2]/div[1]/div[2]/div[2]/div/div[2]/div[3]/div/table/tbody/tr[1]/td[5]').extract()
I´m using the Chrome Inspector and since today an addon called "Xpath helper" for showing the Selectors, but i still don´t really understand what i´m doing there :(. I´d really appreciate any idea how to access that data and to give me a better understanding in finding these selectors.
Pretty easy (I used position() to skip table header):
for row in response.xpath('//table[#id="exchange-markets"]//tr[position() > 1]'):
price = row.xpath('.//span[#class="price"]/text()').get()
# price = row.xpath('.//span[#class="price"]/#data-usd').get() #if you need to be more precise
XPATHs are basically //tagname[#attribute='value'] from HTML.
For your site, you can loop over names with //table[#id='exchange-markets']//tr/td[2]/a
and get prices with //table[#id='exchange-markets']//tr/td[5]
where we are basically saying to look within the table rows on column 5.

Google Sheets xpath query not working

Hoping someone smarter than me can help me sort this out! I've been stumped for a few days now trying to pull some data from website into Google Sheet using ImportXML with no luck.
I'm looking to import the average odds for various sporting events from the website Oddsportal.com which update and change throughout the day. I'd like my sheet to also update these odds, similar to stock prices.
For example:
http://www.oddsportal.com/search/San+Jose+Sharks/
I would like to pull the Average Odds for Team "1" (+136) into a cell, the odds for Tie "X"(+277) into a cell and Team "2"(+161) into individual cells. Just the odds portion. If it's unable to be pulled from that page it is also listed on http://www.oddsportal.com/hockey/usa/nhl/san-jose-sharks-nashville-predators-6cPaAHOM/ down at the bottom in the Average Odds Row.
This seems simple enough but I just can't seem to get the ImportXML query correct without an error.
I've looked at the page's source code (Ctrl-U). The original html does not contain needed values, they most likely loaded later thru xhr (ajax) call:
So most likely you'll not succeed with mere a request html.
You need to explore Network in the browser DevTools to find out what request is initiated (by JS files) to get needed data. This might be even unique one containing hash signiture, so you'll not reproduce it for future use.
I recommend you to turn to scriping tools for retrieving that info.

Google spreadsheet ImportXML Error:"the XPath query did not return any data"

I continue to get this error when I try to run this XPath query
//div[#iti='0']
on this link (flight search from google)
https://www.google.com/flights/#search;f=LGW;t=JFK;d=2014-05-22;r=2014-05-26
I get something like this:
=ImportXML("https://www.google.fr/flights/#search;f=jfk;t=lgw;d=2014-02-22;r=2014-02-26";"//div[#iti='0']")
I verified and the XPath is correct (I get the answer wanted using XPath helper, the answer wanted are the data relative to the first flight selected).
I guess that it is a problem of syntax, but I tried more or less all the combinations of lower/uppercase, punctuation (replacing ; , ' ") and I tried to link the URI and the XPath query stored in cells, but nothing works.
Any help will be appreciated.
As a matter of fact, maybe it is a bug on the new google sheets or they have changed how the function works. I've activated mine and when I try to use the ImportXML it simply wont work. Since I have some old sheets here (on the old mechanism) they still work normally. If I copy and paste the script from the old to the new one it simply doesn't get any data.
Here a example:
=ImportXML("http://www.nytimes.com/pages/todayspaper/index.html";"//div[#class='columnGroup first']//h3")
If I run this on the old mechanism it works fine, but if I run the same on the new mechanism, first it will exchange my ";" for a "," and then it will bring a "#N/A" with a warning of "Error: Imported XML content cannot be parsed".
Edit (05/05/2015):
I am happy to say that I tested this function again today on the new spreadsheets and they've fixed it. I was checking that every two months and now finally they have solved this issue. The example I've added above is now returning information.
I'm sorry, but you won't be able to easily parse Google result pages. The reason your function throws an error is because the content of the page you see in your browser is generated by javascript, and Google spreadsheet doesn't execute js.
Your ImportXML has the right syntax, it doesn't return anything because the node you're looking for isn't there (importXML Parse Error).
You will have to find another source if you want these result in your spreadsheet. For info some libraries already parse the usual result page (http://www.seerinteractive.com/blog/google-scraper-in-google-docs-update for example, if it still works), but I doubt finding one for your special case will be easy.
This gives the answer (importXML Parse Error), but it's not entirely obvious.
ImportXML doesn't load Javascript. When you're building ImportXML queries on Google results, make sure you're testing against a version of the page that has Javascript turned off. You can do this using the Chrome DevTools.
(But I agree that ImportXML is fickle, idiosyncratic, and generally rage-inducing).

Resources