Google Sheets IMPORTXML Text Field from Website - xpath

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.

the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

Google Sheets ImportXML issues

I have a google sheet that I'm trying to automate as much as possible for my WoW Raid group. What I'm trying to do here is parse some data from WoW's armory to automatically pull a persons item level.
I am having issues pulling from WoW's website directly (https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy), but I can pull the item level from another site (https://raider.io/characters/us/sargeras/beansy). The only difference I can spot is that one site I can pull from a [ /div/span/b clas"text-white" ] and from WoW the information is directly in [ /div/class="media-text" ]
WOW Formula =IMPORTXML(C32,"//*[#id='character-profile-mount']/div/div/div[2]/div/div[1]/div[1]/div/div[2]/div[1]/a[1]/div/div[2]")
Raider IO Formula =IMPORTXML(C31,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b")
WOW Inspect Element <div class="Media-text">184 ilvl</div>
Raider IO Inspect Element <b class="text-white">184</b>
Above are the respective formula's and elements I've used. Raider IO's pulls properly and outputs 184 as it's information. However WoW's does not pull properly and outputs N/A Google Sheets Output Screencap
Does anyone have any ideas on why this might be happening?
Thanks in advance!
I think that the https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy prepares the values using Javascript. For example, when the HTML without using Javascript is retrieved from this URL, Media-text cannot be found in the retrieved HTML. On the other hand, https://raider.io/characters/us/sargeras/beansy has the values in the HTML without using Javascript. I thought that the difference is due to this.
But in order to retrieve the value of 184 from URL of the former, when I saw the HTML without using Javascript, I noticed that the value is included in the metadata. So when the value of 184 is retrieved from the metadata, the sample formula is as follows.
Sample formula:
=REGEXEXTRACT(IMPORTXML(A1,"//meta[#name='description']/#content"),"(\d+) ilvl")
In this formula, the URL of https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy is put to the cell "A1".
Result:
Also, as additional modification, about your =IMPORTXML(URL,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b"), in this case, the xpath might be able to be modified simple a little as follows.
Modified formula:
=IMPORTXML(A1,"//span[contains(text(),'Item Level')]/b[#class='text-white']")
In this formula, the URL of https://raider.io/characters/us/sargeras/beansy is put to the cell "A1".
Result:
References:
IMPORTXML
REGEXEXTRACT

Incorrect URL/XPath when using Google sheets IMPORTXML

I'm trying to import a search result from google to my spreadsheet. I've had success with Wikipedia pages, but for some reason, Google search isn't working correctly (giving a "could not fetch url" error). I'm sure the problem is somewhere in my URL or XPath, but I've been trying a variety of things and I'm lost. Here is what I've got:
=IMPORTXML("https://www.google.com/search?q=dom+fera+easy+thing+released", "//div[#class='Z0LcW XcVN5d']")
I'm linking the spreadsheet below as view-only for reference as well. Ultimately the goal is to be able to webscrape release years of songs. I'd appreciate any help!
https://docs.google.com/spreadsheets/d/1bt8MJ23nfGAv6ianaR-sd7DM5DNn98p7zWSG1UzBlEY/edit?usp=sharing
AFAIK, you can't parse results from GoogleSearch in Google Sheets.
Using Discogs, MusicBrainz, All Music... to get the release dates could be useful.
But it seems some of your groups are little known. So, you can use Youtube to fetch the dates.
Note : we assume the year of publication on Youtube corresponds to the year of release.
Of course, that's not 100% true. For example, artists can clip their video months after release. Or publish nothing on Youtube.
So this method will work with a wide range of songs but not ALL the songs. With recent bands and songs, it should be OK.
To do this you can use the Youtube API or IMPORTXML formulas. In both cases, we always take the first result (relevant order) of the search engine as source.
You need an API key and an ImportJSON script (credits to Brad Jasper) to use the API method. Once you have installed the script and activated your API key,you can paste in cell B3:
="https://www.googleapis.com/youtube/v3/search?key={yourAPIKey}&part=snippet&type=video&filter=items&regionCode=FR&q="&ENCODEURL(A3)
We generate the url to query with the content you input in column A.
We use "regionCode=FR" since some songs are not available in the US ("i need you FMLYBND"). That way we get the correct release date.
In C3, you can paste :
=LEFT(QUERY(ImportJSON(B3);"SELECT Col11 LIMIT 1 label Col11''";1);4)
We parse the JSON, select the column of interest, the line of interest, then we clean the result.
With the IMPORTXML method, you can paste in E3 :
="https://www.youtube.com"&IMPORTXML("https://www.youtube.com/results?search_query="&A3;"(//div[#class='yt-lockup-thumbnail contains-addto'])[3]/a/#href")
We construct the url with the first search result of the search engine.
In F3, you can paste :
=LEFT(IMPORTXML(E3;"//meta[#itemprop='datePublished']/#content");4)
We parse the previously built url, then we extract the year of publication.
As you can see, there's a difference in the results on line 5. That's because the song is not available in the US. The first result returned in the IMPORTXML method is different from the one of the API method which uses a "FR" flag.
Side note : I'm based in Europe. So ";" in the formulas should be replaced with ",".
google does not support web scraping of google search into google sheets. this option was disabled 2 years ago. you will need to use alternative search engine

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

Using XPath to get strings between and inside tags

Super new to XPath so forgive me if I stumble through terms. I'm using IMPORTXML() in a google doc in order to pull info from a webpage. Basically what I'm shooting for is to turn this
into
What I can't figure out is how to pull info between the <br> nodes and pull the string from within the <a> node.
I've fumbled my way as far as =IMPORTXML($A$1, "//p/b[starts-with(text(), '"& $A4 &"')]/following-sibling::text()[1]") to get a return of 1 for Casting Time, but not any further.
The end goal is to do this for about a dozen different values across the page and cycle the checks through about 500 web pages, hence the cells in the formula. Any help would be appreciated.
Super in depth clarification section
Using XPath and a Google Sheet I am attempting to automatically make a roll20 formatted template macro for each spell on a spell casters list.
For example, the Shaman Spell List I used //tr/td[1]/a[#href] and //tr/td[1]/a/#href to create side by side columns of spell names and their associated URL's.
Then on another page I can copy and paste the entire class spell list and use Vlookup to get the associated URL's while keeping the organized level sectioned tables like so (Note the Hyperlinked spell names are rich text so the internal URL is invisible to IMPORTXML, hence the extra step).
With a single class having upwards of 500+ spells the ultimate goal is to create a series of IMPORTXML that look at the spell URL and pull relevant data from this particular section. For this example I'm using Arcane Mark.
The final goal is to use IMPORTXML to get each important category such as School, Casting Time, Target, Effect, Area, Range, etc. Put them in their respective columns and have a Concatenate I've written go through and pull all the various parts into one big formatted string compatible with the roll20 macro template to look like &{template:default} {{Name=Arcane mark}} {{School=Universal}} {{Casting Time=1 Standard Action}} {{Components=V,S}} {{Range=Touch}} {{Effect=One personal rune or mark, all of which must fit within 1 sq. ft.}} {{Duration=Permanent}} {{Saving Throw=None}} {{Spell Resistance=No}}
=ARRAYFORMULA(REGEXEXTRACT(TRANSPOSE(QUERY(TRANSPOSE(QUERY(ARRAY_CONSTRAIN(
IMPORTDATA("http://www.d20pfsrd.com/magic/all-spells/a/arcane-mark"),1000,5),
"where Col1 contains 'School'", 0)),,999^99)), A10&"\</b>\ (.+)\;"))

Resources