Google Sheets ImportXML issues - google-sheets-formula

I have a google sheet that I'm trying to automate as much as possible for my WoW Raid group. What I'm trying to do here is parse some data from WoW's armory to automatically pull a persons item level.
I am having issues pulling from WoW's website directly (https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy), but I can pull the item level from another site (https://raider.io/characters/us/sargeras/beansy). The only difference I can spot is that one site I can pull from a [ /div/span/b clas"text-white" ] and from WoW the information is directly in [ /div/class="media-text" ]
WOW Formula =IMPORTXML(C32,"//*[#id='character-profile-mount']/div/div/div[2]/div/div[1]/div[1]/div/div[2]/div[1]/a[1]/div/div[2]")
Raider IO Formula =IMPORTXML(C31,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b")
WOW Inspect Element <div class="Media-text">184 ilvl</div>
Raider IO Inspect Element <b class="text-white">184</b>
Above are the respective formula's and elements I've used. Raider IO's pulls properly and outputs 184 as it's information. However WoW's does not pull properly and outputs N/A Google Sheets Output Screencap
Does anyone have any ideas on why this might be happening?
Thanks in advance!

I think that the https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy prepares the values using Javascript. For example, when the HTML without using Javascript is retrieved from this URL, Media-text cannot be found in the retrieved HTML. On the other hand, https://raider.io/characters/us/sargeras/beansy has the values in the HTML without using Javascript. I thought that the difference is due to this.
But in order to retrieve the value of 184 from URL of the former, when I saw the HTML without using Javascript, I noticed that the value is included in the metadata. So when the value of 184 is retrieved from the metadata, the sample formula is as follows.
Sample formula:
=REGEXEXTRACT(IMPORTXML(A1,"//meta[#name='description']/#content"),"(\d+) ilvl")
In this formula, the URL of https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy is put to the cell "A1".
Result:
Also, as additional modification, about your =IMPORTXML(URL,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b"), in this case, the xpath might be able to be modified simple a little as follows.
Modified formula:
=IMPORTXML(A1,"//span[contains(text(),'Item Level')]/b[#class='text-white']")
In this formula, the URL of https://raider.io/characters/us/sargeras/beansy is put to the cell "A1".
Result:
References:
IMPORTXML
REGEXEXTRACT

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

Google Sheets IMPORTXML Text Field from Website

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.
the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Google sheets importxml weird import - Can't get the correct path to elements

I'm trying to get some data from this website https://etfdb.com/etf/VOO/with IMPORTXML. Unfortunately, I was not able to scrape a particular element of the page but rather I got data only from these two functions
=IMPORTXML("https://etfdb.com/etf/VOO","//*")
=IMPORTXML("https://etfdb.com/etf/VOO","/html")
I tried to see if the browser is only loading data through JS but after disabling it the site loaded correctly, so I don't think JS might be the problem here.
How come after running a simple function like this, I get an error saying the scraped content is empty?
//span[contains(text(),'Tracks This Index:')]/following-sibling::span
EDIT: added spreadsheet with desired output https://docs.google.com/spreadsheets/d/1Zn0fQwenYZo6u4jP0yZ7J-NCzyzRnqabR3CDUz8jP3E/edit?usp=sharing
How about this answer?
Issue:
Unfortunately, the value cannot be retrieved with the xpath of //span[contains(text(),'Tracks This Index:')]/following-sibling::span from the HTML data of the URL. For example, even when //span is used, #N/A is returned. The reason of this issue is mentioned by Rubén's answer.
Workaround:
Here, I would like to propose a workaround. Please think of this as just one of several answers. In this workaround, the value you want is retrieved from all values from body. Although each tag in the body cannot be retrieved, //body can be retrieved. And fortunately, the value you want is included in the value from //body. The flow of this workaround is as follows.
Retrieve values from the xpath of //body.
Retrieve the value you want by the regular expression.
Sample formula:
=TEXTJOIN("",TRUE,IFNA(ARRAYFORMULA(TRIM(REGEXEXTRACT(IMPORTXML(A1,"//body"),"Tracks This Index: (\w.+)"))),""))
In this sample, the cell "A1" has the URL of https://etfdb.com/etf/VOO.
After the value of //body was retrieved, the value is retrieved by the regular expression.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTXML
REGEXEXTRACT
ARRAYFORMULA
IFNA
TEXTJOIN
If this was not the direction you want, I apologize.
This is partial answer.
The problem occurs because https://etfdb.com/etf/VOO/ isn't a valid XHTML file.
Some failures:
Use of <hr> instead of <hr/>
Use of <br> instead of <br/>
The above failures cause that IMPORTXML can't parse below sibling tags.

Google Sheets xpath query not working

Hoping someone smarter than me can help me sort this out! I've been stumped for a few days now trying to pull some data from website into Google Sheet using ImportXML with no luck.
I'm looking to import the average odds for various sporting events from the website Oddsportal.com which update and change throughout the day. I'd like my sheet to also update these odds, similar to stock prices.
For example:
http://www.oddsportal.com/search/San+Jose+Sharks/
I would like to pull the Average Odds for Team "1" (+136) into a cell, the odds for Tie "X"(+277) into a cell and Team "2"(+161) into individual cells. Just the odds portion. If it's unable to be pulled from that page it is also listed on http://www.oddsportal.com/hockey/usa/nhl/san-jose-sharks-nashville-predators-6cPaAHOM/ down at the bottom in the Average Odds Row.
This seems simple enough but I just can't seem to get the ImportXML query correct without an error.
I've looked at the page's source code (Ctrl-U). The original html does not contain needed values, they most likely loaded later thru xhr (ajax) call:
So most likely you'll not succeed with mere a request html.
You need to explore Network in the browser DevTools to find out what request is initiated (by JS files) to get needed data. This might be even unique one containing hash signiture, so you'll not reproduce it for future use.
I recommend you to turn to scriping tools for retrieving that info.

Resources