Using Google Sheets to scrape horse racing odds - xpath

I'm new to google-sheets so have a lot to learn, I'm wanting to scrape latest horse odds from daily races on https://www.racenet.com.au
An example URL would be
https://www.racenet.com.au/racing-form-guide/ararat-20191210/all-races
I am having trouble getting importxml to pick up any useful data no matter what parameters I try, can anyone give me any suggestions on the correct syntax and parameters to get the horse name and odds from this site...

the first step is to block JavaScript on a given website to see what left to be scraped:
Google Sheets does not support importing of JS elements
then the next step would be running the IMPORTXML formula to see stuff that can be scraped
=IMPORTXML("URL_here", "//*")

Related

Incorrect URL/XPath when using Google sheets IMPORTXML

I'm trying to import a search result from google to my spreadsheet. I've had success with Wikipedia pages, but for some reason, Google search isn't working correctly (giving a "could not fetch url" error). I'm sure the problem is somewhere in my URL or XPath, but I've been trying a variety of things and I'm lost. Here is what I've got:
=IMPORTXML("https://www.google.com/search?q=dom+fera+easy+thing+released", "//div[#class='Z0LcW XcVN5d']")
I'm linking the spreadsheet below as view-only for reference as well. Ultimately the goal is to be able to webscrape release years of songs. I'd appreciate any help!
https://docs.google.com/spreadsheets/d/1bt8MJ23nfGAv6ianaR-sd7DM5DNn98p7zWSG1UzBlEY/edit?usp=sharing
AFAIK, you can't parse results from GoogleSearch in Google Sheets.
Using Discogs, MusicBrainz, All Music... to get the release dates could be useful.
But it seems some of your groups are little known. So, you can use Youtube to fetch the dates.
Note : we assume the year of publication on Youtube corresponds to the year of release.
Of course, that's not 100% true. For example, artists can clip their video months after release. Or publish nothing on Youtube.
So this method will work with a wide range of songs but not ALL the songs. With recent bands and songs, it should be OK.
To do this you can use the Youtube API or IMPORTXML formulas. In both cases, we always take the first result (relevant order) of the search engine as source.
You need an API key and an ImportJSON script (credits to Brad Jasper) to use the API method. Once you have installed the script and activated your API key,you can paste in cell B3:
="https://www.googleapis.com/youtube/v3/search?key={yourAPIKey}&part=snippet&type=video&filter=items&regionCode=FR&q="&ENCODEURL(A3)
We generate the url to query with the content you input in column A.
We use "regionCode=FR" since some songs are not available in the US ("i need you FMLYBND"). That way we get the correct release date.
In C3, you can paste :
=LEFT(QUERY(ImportJSON(B3);"SELECT Col11 LIMIT 1 label Col11''";1);4)
We parse the JSON, select the column of interest, the line of interest, then we clean the result.
With the IMPORTXML method, you can paste in E3 :
="https://www.youtube.com"&IMPORTXML("https://www.youtube.com/results?search_query="&A3;"(//div[#class='yt-lockup-thumbnail contains-addto'])[3]/a/#href")
We construct the url with the first search result of the search engine.
In F3, you can paste :
=LEFT(IMPORTXML(E3;"//meta[#itemprop='datePublished']/#content");4)
We parse the previously built url, then we extract the year of publication.
As you can see, there's a difference in the results on line 5. That's because the song is not available in the US. The first result returned in the IMPORTXML method is different from the one of the API method which uses a "FR" flag.
Side note : I'm based in Europe. So ";" in the formulas should be replaced with ",".
google does not support web scraping of google search into google sheets. this option was disabled 2 years ago. you will need to use alternative search engine

Extracting Financial Statement Data Using Google Sheets + IMPORTXML [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to import, into google sheets, the last quarter's research and development expense for a few thousand companies from their financial statements. While I want to import several different elements from financial statements, the last quarter R&D expense is currently pertinent (and potentially the previous 3 quarters).
I have tried several different sites (yahoo finance, bloomberg, etc) but the simplest URL seems to be from stockrow.com because I can simply automate the substitution of the stock ticker in the URL.
To get the xpath, I inspect the element and copy the xpath using the browser (have tried with Chrome and Firefox).
I am using IMPORTXML on googlesheets and, on my last attempt, used the following input: =IMPORTXML("https://stockrow.com/JNJ/financials/income/quarterly","/html/body/div[1]/div/div/section/div/div[2]/div[1]/section[4]/div/div[3]/div/div/div[3]/div/div/div[11]/div/span")
I have attempted all sorts of combinations of sites, browsers, and xpaths related to the element, but no matter what I do, I always get the same error "Imported content is empty."
I read xpath google sheet importxml but can't make sense of what is happening in the change to the xpath or how to solve this particular challenge.
Because I want this to be repeatable across multiple stock tickers in google sheets, I am hoping that the "location" of the R&D expense (and other elements in the financial statement) are consistent across all pages, rather than just a specific solution to this challenge.
Looking forward to receiving guidance. Thanks!!
you need some other source. Google Sheets does not support the scraping of JavaScript elements. you can test JS dependency simply by disabling JS for a given site and what's left can be scraped. in your case its nothing:

Google Sheets IMPORTXML XPath help (understanding how to read a page source)

I am trying to write a function that will give me the annual payout dividend for a given stock. The website I am using is www.seekingalpha.com
So I understand that the function is =IMPORTXML (URL, xpath_query).
In that case, my URL is: https://seekingalpha.com/symbol/VOO/dividends/scorecard
but the problem I am having is figuring out the correct XPath to acquire the dividend value.
I currently have this as my function:
=IMPORTXML(CONCATENATE("https://www.seekingalpha.com/symbol/", $B2, "/dividends/scorecard"), "//body")
$B2 is a cell that holds the ticker symbol if you are wondering. Anyways, I right-clicked the number I wanted from the website and followed it downstream and tried seeing where it is nested under but keep running into the wrong "directory" per se, because I am usually left with an error "Empty."
I have also tried copying the xPath directly:
/html/body/div[2]/div[1]/div/div[1]/div/div/div[2]/section/section[1]/table/tbody/tr/td[1]
but am greeted with another empty field error.
Could anyone point me in the right direction? I've been researching this for a while and figured this would be a great way to learn. Thank you in advance
you need some other source. Google Sheets does not support scraping of JavaScript elements. you can test JS dependency simply by disabling JS for given site and what's left can be scraped. in your case its nothing:
UPDATE:
=INDEX(IMPORTXML("https://stocknews.com/stock/"&A15&"/dividends/",
"//div[#class='grade-cat-ytd']"), 2)
understanding how to read a page source

Google Sheets IMPORTXML Text Field from Website

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.
the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Google Sheets xpath query not working

Hoping someone smarter than me can help me sort this out! I've been stumped for a few days now trying to pull some data from website into Google Sheet using ImportXML with no luck.
I'm looking to import the average odds for various sporting events from the website Oddsportal.com which update and change throughout the day. I'd like my sheet to also update these odds, similar to stock prices.
For example:
http://www.oddsportal.com/search/San+Jose+Sharks/
I would like to pull the Average Odds for Team "1" (+136) into a cell, the odds for Tie "X"(+277) into a cell and Team "2"(+161) into individual cells. Just the odds portion. If it's unable to be pulled from that page it is also listed on http://www.oddsportal.com/hockey/usa/nhl/san-jose-sharks-nashville-predators-6cPaAHOM/ down at the bottom in the Average Odds Row.
This seems simple enough but I just can't seem to get the ImportXML query correct without an error.
I've looked at the page's source code (Ctrl-U). The original html does not contain needed values, they most likely loaded later thru xhr (ajax) call:
So most likely you'll not succeed with mere a request html.
You need to explore Network in the browser DevTools to find out what request is initiated (by JS files) to get needed data. This might be even unique one containing hash signiture, so you'll not reproduce it for future use.
I recommend you to turn to scriping tools for retrieving that info.

Resources