xpath in =importXML() for extracting meta descriptions - xpath

I'm trying to use Xpath to pull in the meta descriptions from web pages, using Google Sheets.
I have this working to pull in the titles: =importXml(www.example.com; "//title")
Here are two sources of my learning:
http://seogadget.co.uk/playing-around-with-importxml-in-google-spreadsheets/
http://docs.google.com/support/bin/answer.py?hl=en&answer=75507
I have read many other posts on this site, and this seems to be the similar idea of what I want:
"/html/head/meta[#name='description']/#content"
"/*/head/meta[#name='description']/#content"
"//head/meta[#name=\"description\"]/#content"
None of these work in Google Sheets, which specifies to write it in Xpath. The only difference, is that in Google Sheets you are to use ' in place of " (hence why description is like that). I've honestly tried it about 219 different ways....no luck.
Any ideas? Thanks in advance!

//meta[#name='description']/#content
So your full URL call in google sheet would be
=importxml(A1,"//meta[#name='description']/#content")
I've built some awesome SEO tools using importXML - this is just the start of it mate! :)

Related

Google Sheets ImportXML Error: Imported XML content cannot be parsed

I'm trying to import stock suggestions into a spreadsheet and have successfully done so from 3 websites, but I am struggling with the 4th.
=IMPORTXML("https://www.tipranks.com/stocks/amzn/price-target","//h3[#class='client-components-stock-research-analysts-analyst-consensus-style__buy'")
Is the command I'm using to try to pull the tag under "Analyst Consensus Rating" from https://www.tipranks.com/stocks/amzn/price-target. But I keep getting:
Error Imported XML content can not be parsed.``` Tips for what I'm doing wrong would be highly appreciated
Google Sheets does not support web scraping of JavaScript controlled elements. you can easily check it by disabling JS for a given site and only what's left visible can be scraped. unfortunately, in your case, that's nothing:

IMPORTHTML or IMPORTXML to collect data from a site

I have made several attempts to collect the data within this table:
The simple ways of the two functions I've commented on, I've tried, but not succeeded.
I would like to if anyone knows any other way to collect this data in Google Sheets.
Site Link:
https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad
the table you want to scrape is under JavaScript control, therefore, it can't be scraped.
all you can get from that site into Google Sheets is:
=ARRAY_CONSTRAIN(IMPORTDATA(
"https://www.onlinebettingacademy.com/stats/team/brazil/operrio-pr/13217#tab=t_squad&team_id=13217"); 10000; 10)
Because the page you are trying to scape is rendered using Javascript — i.e. the content you are looking to scrape is not in the markup, you will not be able to use a tool like Google Sheets.
However... you can easily scrape this by using a "headless browser". You pretty much will use a browser (without a UI) that will render your URL with the Javascript, and then once the page is loaded, you query the data using XPATH, etc.
Check out Puppeteer for an example of a JS framework that you can use for this task.

XPath scrape using google sheets

I have been struggling to get any XPath technique to work on octoparse and similar software. I'm now trying google sheets from reading posts here and can't get it to work either.
Input: A slideshare presentation url (eg https://www.slideshare.net/carologic/ai-and-machine-learning-demystified-by-carol-smith-at-midwest-ux-2017)
Intended output: Slideshare embed url (in this case: https://www.slideshare.net/slideshow/embed_code/key/wZudqqTdctjWXA)
I think this would be the way to get the output using google sheets: =importxml(A1,"//meta[#itemprop='embedURL']/#content")
It is not working for me (failure to fetch url). With Octoparse etc I just got a blank value.
I'm being daft here, no doubt. Any help would be useful.
It doesn't work because slideshare is owned by LinkedIN, and they have put in a lot of effort to ensure they cant be scraped, including google sheets. Before it was possible, but I believe they eventually caught on to the work around.

Google Site: list template + local spreadsheet/database

I am building a couple of "list" (based on the page template) pages on a google site. This is all great. The thing though is that I want to have a local copy (spreadsheet) as a backup or for offline use.
My first question is: can I somehow download the list as a spreadsheet?
Moreover, it would be much better if I could use the nice functionality of the list template (i.e; one simple form to enter all data for an entry; I already use this in my list template, including drop down lists, as well as the ability to sort by columns!) and at the same time be able to download a cope of the list or have it in my Drive.
Is that possible? and how?
thanks
Yes you can...sort off. It is possible to do some basic web scraping in Google Sheets with the functions:
IMPORTHTML
IMPORTXML
Between the two, IMPORTXML works well to get specific data while IMPORTHTML is perfect for grabbing tables Using Google Sheets as a basic web scraper.
From there you can download the Google Sheets with the list and keep a copy in your Google Drive normally an even export it as a PDF or excel file.

using Xpath to get a href from html

First time posting here and newbie at Google apps. I am putting together a url in a spreadsheet for a linkedin company. example: http://www.linkedin.com/company/National-Renewable-Energy-Laboratory
Can I use =importXML from a google spreadsheet plus Xpath to get the website url that is listed on each company page.
I have gotten to a point where I can extract all the href's from the page and the link that I need is in that, but I just want the website url.
Here is what I am using so far:
=importXML(R2, "//*[#href]")
Here is a link to my spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AheVK6uxf6AvdHhILTFrR1k4Wl9tWW5OVWpRRUJKMlE
The code is in S2
Appreciate your response.
//*[#href] matches elements that have href, not the href attributes themselves. Try //#href instead.
It's more complicated, but a good solution would be to use the LinkedIn API, which you can access using UrlFetchApp.

Resources