Using XPath to get strings between and inside tags - xpath

Super new to XPath so forgive me if I stumble through terms. I'm using IMPORTXML() in a google doc in order to pull info from a webpage. Basically what I'm shooting for is to turn this
into
What I can't figure out is how to pull info between the <br> nodes and pull the string from within the <a> node.
I've fumbled my way as far as =IMPORTXML($A$1, "//p/b[starts-with(text(), '"& $A4 &"')]/following-sibling::text()[1]") to get a return of 1 for Casting Time, but not any further.
The end goal is to do this for about a dozen different values across the page and cycle the checks through about 500 web pages, hence the cells in the formula. Any help would be appreciated.
Super in depth clarification section
Using XPath and a Google Sheet I am attempting to automatically make a roll20 formatted template macro for each spell on a spell casters list.
For example, the Shaman Spell List I used //tr/td[1]/a[#href] and //tr/td[1]/a/#href to create side by side columns of spell names and their associated URL's.
Then on another page I can copy and paste the entire class spell list and use Vlookup to get the associated URL's while keeping the organized level sectioned tables like so (Note the Hyperlinked spell names are rich text so the internal URL is invisible to IMPORTXML, hence the extra step).
With a single class having upwards of 500+ spells the ultimate goal is to create a series of IMPORTXML that look at the spell URL and pull relevant data from this particular section. For this example I'm using Arcane Mark.
The final goal is to use IMPORTXML to get each important category such as School, Casting Time, Target, Effect, Area, Range, etc. Put them in their respective columns and have a Concatenate I've written go through and pull all the various parts into one big formatted string compatible with the roll20 macro template to look like &{template:default} {{Name=Arcane mark}} {{School=Universal}} {{Casting Time=1 Standard Action}} {{Components=V,S}} {{Range=Touch}} {{Effect=One personal rune or mark, all of which must fit within 1 sq. ft.}} {{Duration=Permanent}} {{Saving Throw=None}} {{Spell Resistance=No}}

=ARRAYFORMULA(REGEXEXTRACT(TRANSPOSE(QUERY(TRANSPOSE(QUERY(ARRAY_CONSTRAIN(
IMPORTDATA("http://www.d20pfsrd.com/magic/all-spells/a/arcane-mark"),1000,5),
"where Col1 contains 'School'", 0)),,999^99)), A10&"\</b>\ (.+)\;"))

Related

Octoparse and relative Xpath iframe extraction issues

I am trying to use Octoparse to extract the podcast details from Marie Brown's "Beyond the kitchen table" website. https://beyondthekitchentable.co.uk/podcast/
I'm using Octoparse's free version which allows for scraping locally. The problem is that while Octoparse will automatically auto-detect the Title, Title_URL, and Content webpage data and correctly set up the Pagination, Scroll Page, and Loop item workflow to extract (Title, Title_URL, and Content fields), it does not auto-detect the 'Date' and 'Podcast time duration' fields of each individual podcast as these pieces appear to be getting embedded from an iframe. However, while I am able to custom add Date and Podcast time duration using an Absolute Xpath i.e. //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]. This results in the same value copied for each record. So when I attempt to fix this by using the Relative XPath setting in Octoparse to loop each item //span[#class="cp-episode-date"] in order to gather all individually unique, it does not get any values even though this relative Xpath //span[#class="cp-episode-date"] is finding all items when I use WebDevTools to search and find all occurrences seen within Chrome. I saw what might be another helpful post on Stackexchange about this but I was not able to make sense of it.
This portion //span[#class="cp-episode-date"] is relative Xpath as it finds multiple Date items in Chrome WebDevTools but it is not complete and I am not sure how to implement the unique Iframe traversal for the Date and Podcast time duration custom added fields I added that Octoparse's Relative XPath settings are looking for. I even tried to install the SelectorsHub Chrome browser extension but it didn't pull up the nested SelectorHub to query the Xpath the way the SelectorHub Youtube video demonstrates - it only showed me the relative Xpath I already am showing below.
Please have a look at this site using Octoparse and see if it is possible. If so, how can I do it?
When Absolute Path is used - //div[#class="cfm-episodes-list"]/div[1]/div[2]/div[1]/iframe[1]
vs.
When Relative Path is used - //span[#class="cp-episode-date"]
There are plenty of iframes inside the webpage. I don't know if Octoparse could handle this. Choose another starting point.
For example, use Apple Podcast :
https://podcasts.apple.com/gb/podcast/the-website-coach/id1587503231
Dates could be recovered with the following XPath :
//div[#class="l-row"]//time[#class]/#aria-label
Other possibility, scrape the following page :
https://feeds.captivate.fm/the-website-coach/
Dates could be recovered with the following XPath :
//h4/text()
Even easier, get directly the data from this URL (.json file) :
https://itunes.apple.com/lookup?id=1587503231&media=podcast&entity=podcastEpisode&limit=100

tool for extracting xpath query from speciifed/selected node

Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/

Google Sheets IMPORTXML Text Field from Website

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.
the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Kofax Seperate Main Invoice from Supporting Document without using Seperator sheet

When a batch gets created documents should get separated automatically without using separator sheet or Barcode separator.
How can I classify documents for Invoice and supporting document.
In our project we get many invoices with supporting document so the scanning person has to insert the separator sheets manually, so to avoid this we want to automatically classify the supporting documents.
In general the concept would be that you would enable separation in the project and then train your classes with examples to be used for the layout or content classifiers.
However, as I'm sure you've seen, the obstacle with invoices is that they are different enough between vendors that it would not reliably classify all to an Invoice class. Similarly with "Supporting Documents" which are likely to be very different from each other, so unfortunately there isn't a completely easy answer without separator sheets (or barcode stickers affixed to supporting docs).
What you might want to do is write code in the one of the separation events like Document_AfterSeparate event. Despite the name, the document has not yet been split at this point, but the classifiers have run. See Scripting Help topic "Server Script Events Sequence > Document Separation > Standard Document Separation" for more detail. Setting the SplitPage property on the CDocPage (pXDoc.CDoc.Pages.ItemByIndex(lPage).SplitPage) will allow you to use your own logic to determine which pages to separate.
For example if you know that you will always have single page invoices, you can split on the first page and classify accordingly. Or you can try to search for something that indicates the end of the invoice like "Total" or other characteristics. There is an example of how you can use locators to help separation in the Scripting Help topic "Script Samples > Use Locator Results for Standard Document Separation". The example uses a Barcode Locator, but the same concept works if you wanted to try it with a Format Locator or anything else.
Without Separator sheets you will need a smart classification software like Kofax Transformation Module (KTM). Its kind of expensive. you will need to verify the cost saving and ROI.

How do I take each line of a text file and insert them into a web form? Specifically, for testing domain name availability

I wrote a Ruby script that appended "data" to the beginning of every word of the English dictionary, and then filtered out various strings using different parameters, and now I want to use a site like namecheap or gandi.net in order to take each of these strings and insert them into the domain name availability checker in order to determine which ones are available.
It is my understanding that this will involve making a POST HTTP request of some kind, as well as grabbing the element in question, but I don't really understand the dynamics of what to read about in order to do this kind of thing.
I imagine that after a few requests I will be limited, but as a learning exercise I am still curious as to how I would go about doing this.
I inspected the element (on namecheap) to see what the tag looked like, to find any uniquely identifiable class/id names that I could use to grab that specific part of the source, and found that inside a fieldset tag, there was a line of HTML that I can't seem to paste here, so here is a picture:
Thanks in advance for any guidance in helping me learn about web scripting!

Resources