Octoparse and relative Xpath iframe extraction issues - xpath

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]

There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

Related

Google Sheets ImportXML issues

I have a google sheet that I'm trying to automate as much as possible for my WoW Raid group. What I'm trying to do here is parse some data from WoW's armory to automatically pull a persons item level.
I am having issues pulling from WoW's website directly (https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy), but I can pull the item level from another site (https://raider.io/characters/us/sargeras/beansy). The only difference I can spot is that one site I can pull from a [ /div/span/b clas"text-white" ] and from WoW the information is directly in [ /div/class="media-text" ]
WOW Formula =IMPORTXML(C32,"//*[#id='character-profile-mount']/div/div/div[2]/div/div[1]/div[1]/div/div[2]/div[1]/a[1]/div/div[2]")
Raider IO Formula =IMPORTXML(C31,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b")
WOW Inspect Element <div class="Media-text">184 ilvl</div>
Raider IO Inspect Element <b class="text-white">184</b>
Above are the respective formula's and elements I've used. Raider IO's pulls properly and outputs 184 as it's information. However WoW's does not pull properly and outputs N/A Google Sheets Output Screencap
Does anyone have any ideas on why this might be happening?
Thanks in advance!
I think that the https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy prepares the values using Javascript. For example, when the HTML without using Javascript is retrieved from this URL, Media-text cannot be found in the retrieved HTML. On the other hand, https://raider.io/characters/us/sargeras/beansy has the values in the HTML without using Javascript. I thought that the difference is due to this.
But in order to retrieve the value of 184 from URL of the former, when I saw the HTML without using Javascript, I noticed that the value is included in the metadata. So when the value of 184 is retrieved from the metadata, the sample formula is as follows.
Sample formula:
=REGEXEXTRACT(IMPORTXML(A1,"//meta[#name='description']/#content"),"(\d+) ilvl")
In this formula, the URL of https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy is put to the cell "A1".
Result:
Also, as additional modification, about your =IMPORTXML(URL,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b"), in this case, the xpath might be able to be modified simple a little as follows.
Modified formula:
=IMPORTXML(A1,"//span[contains(text(),'Item Level')]/b[#class='text-white']")
In this formula, the URL of https://raider.io/characters/us/sargeras/beansy is put to the cell "A1".
Result:
References:
IMPORTXML
REGEXEXTRACT

tool for extracting xpath query from speciifed/selected node

Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

Google Sheets IMPORTXML Text Field from Website

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.
the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Using XPath to get strings between and inside tags

Super new to XPath so forgive me if I stumble through terms. I'm using IMPORTXML() in a google doc in order to pull info from a webpage. Basically what I'm shooting for is to turn this
into
What I can't figure out is how to pull info between the <br> nodes and pull the string from within the <a> node.
I've fumbled my way as far as =IMPORTXML($A$1, "//p/b[starts-with(text(), '"& $A4 &"')]/following-sibling::text()[1]") to get a return of 1 for Casting Time, but not any further.
The end goal is to do this for about a dozen different values across the page and cycle the checks through about 500 web pages, hence the cells in the formula. Any help would be appreciated.
Super in depth clarification section
Using XPath and a Google Sheet I am attempting to automatically make a roll20 formatted template macro for each spell on a spell casters list.
For example, the Shaman Spell List I used //tr/td[1]/a[#href] and //tr/td[1]/a/#href to create side by side columns of spell names and their associated URL's.
Then on another page I can copy and paste the entire class spell list and use Vlookup to get the associated URL's while keeping the organized level sectioned tables like so (Note the Hyperlinked spell names are rich text so the internal URL is invisible to IMPORTXML, hence the extra step).
With a single class having upwards of 500+ spells the ultimate goal is to create a series of IMPORTXML that look at the spell URL and pull relevant data from this particular section. For this example I'm using Arcane Mark.
The final goal is to use IMPORTXML to get each important category such as School, Casting Time, Target, Effect, Area, Range, etc. Put them in their respective columns and have a Concatenate I've written go through and pull all the various parts into one big formatted string compatible with the roll20 macro template to look like &{template:default} {{Name=Arcane mark}} {{School=Universal}} {{Casting Time=1 Standard Action}} {{Components=V,S}} {{Range=Touch}} {{Effect=One personal rune or mark, all of which must fit within 1 sq. ft.}} {{Duration=Permanent}} {{Saving Throw=None}} {{Spell Resistance=No}}
=ARRAYFORMULA(REGEXEXTRACT(TRANSPOSE(QUERY(TRANSPOSE(QUERY(ARRAY_CONSTRAIN(
IMPORTDATA("http://www.d20pfsrd.com/magic/all-spells/a/arcane-mark"),1000,5),
"where Col1 contains 'School'", 0)),,999^99)), A10&"\</b>\ (.+)\;"))

Resources