Importxml google spreadsheet parsing formula error - xpath

I tried to use this formula
=ImportXML("http://www.google.com/search?q=philadelphia seo company&num=100", "//h3[#class='r']/a/#href")
from http://www.seerinteractive.com/blog/importxml-cookbook/
and I get an formula error , you need to enable something in google spreadsheet before using this formula?

You need to encode the part of the search query where "q=philadelphia seo company" meaning all the spaces should be converted to "%20".
end result should look like this:
=ImportXML("http://www.google.com/search?q=philadelphia%20seo%20company&num=100", "//h3[#class='r']/a/#href")
also - i use importxml all the time and with google search results, you can also use "//cite" depends how much of the url you want.

Related

Google Sheet Formula To Extract Domain From Different Format Site URLs

I have a Google spreadsheet where I have lists of URLs. I have formula to extract the domains from the URLs. But the issues is when a URL has multiple names in the domain. For example
I have attached the link to a sample doc and the tow formulae that I tried. These two formulas work perfectly in certain format and not on some other cases. If there is way to club these two or some way to understand the URL format and choose the best formula to extract the domain would be good. I tried by couldn't achieve the desired output. Google sheet link is given below.
Sample google sheet
You can get by with just one formula, REGEXEXTRACT
First, we extract the hostname from the url. To do that, we use the following formula:
=REGEXEXTRACT(A2:A,"(?:www\.)?([\w._\-]{6,})")
Now, we extract the domain from the hostname. You can do it like this:
=REGEXEXTRACT(...hostname... ,"[\w_\-]+\.\w{0,4}\.?\w{0,4}$")
And now we build everything into a single array formula:
=ARRAYFORMULA(if(A2:A<>"",REGEXEXTRACT(REGEXEXTRACT(A2:A,"(?:www\.)?([\w._\-]{6,})"),"[\w_\-]+\.\w{0,4}\.?\w{0,4}$"),))
I don't pretend to be the best solution to your task - and probably someone will be able to tell you something simpler.

Incorrect URL/XPath when using Google sheets IMPORTXML

I'm trying to import a search result from google to my spreadsheet. I've had success with Wikipedia pages, but for some reason, Google search isn't working correctly (giving a "could not fetch url" error). I'm sure the problem is somewhere in my URL or XPath, but I've been trying a variety of things and I'm lost. Here is what I've got:
=IMPORTXML("https://www.google.com/search?q=dom+fera+easy+thing+released", "//div[#class='Z0LcW XcVN5d']")
I'm linking the spreadsheet below as view-only for reference as well. Ultimately the goal is to be able to webscrape release years of songs. I'd appreciate any help!
https://docs.google.com/spreadsheets/d/1bt8MJ23nfGAv6ianaR-sd7DM5DNn98p7zWSG1UzBlEY/edit?usp=sharing
AFAIK, you can't parse results from GoogleSearch in Google Sheets.
Using Discogs, MusicBrainz, All Music... to get the release dates could be useful.
But it seems some of your groups are little known. So, you can use Youtube to fetch the dates.
Note : we assume the year of publication on Youtube corresponds to the year of release.
Of course, that's not 100% true. For example, artists can clip their video months after release. Or publish nothing on Youtube.
So this method will work with a wide range of songs but not ALL the songs. With recent bands and songs, it should be OK.
To do this you can use the Youtube API or IMPORTXML formulas. In both cases, we always take the first result (relevant order) of the search engine as source.
You need an API key and an ImportJSON script (credits to Brad Jasper) to use the API method. Once you have installed the script and activated your API key,you can paste in cell B3:
="https://www.googleapis.com/youtube/v3/search?key={yourAPIKey}&part=snippet&type=video&filter=items&regionCode=FR&q="&ENCODEURL(A3)
We generate the url to query with the content you input in column A.
We use "regionCode=FR" since some songs are not available in the US ("i need you FMLYBND"). That way we get the correct release date.
In C3, you can paste :
=LEFT(QUERY(ImportJSON(B3);"SELECT Col11 LIMIT 1 label Col11''";1);4)
We parse the JSON, select the column of interest, the line of interest, then we clean the result.
With the IMPORTXML method, you can paste in E3 :
="https://www.youtube.com"&IMPORTXML("https://www.youtube.com/results?search_query="&A3;"(//div[#class='yt-lockup-thumbnail contains-addto'])[3]/a/#href")
We construct the url with the first search result of the search engine.
In F3, you can paste :
=LEFT(IMPORTXML(E3;"//meta[#itemprop='datePublished']/#content");4)
We parse the previously built url, then we extract the year of publication.
As you can see, there's a difference in the results on line 5. That's because the song is not available in the US. The first result returned in the IMPORTXML method is different from the one of the API method which uses a "FR" flag.
Side note : I'm based in Europe. So ";" in the formulas should be replaced with ",".
google does not support web scraping of google search into google sheets. this option was disabled 2 years ago. you will need to use alternative search engine

Syntax for scraping double quotes in rapidminer (XPATH)

I'm having trouble using xpath in Rapidminer when trying to retrieve reviews form the google play store. The problem seems to be that these reviews are in double quotes and I can't get rapidminer to spit out the text...only blanks. I have a number of other xpath queries that are working fine for other commands where i use divs and span etc. I'm able to get things to work on google spreadsheet for this query through =importXML, but not in rapidminer.
This is what I have in XPATH:
//*[#class='review-text']")
So I added a /text() to the end and still nothing. I have played around with adding //div instead of //* and have used h:/span also. I'm kind of hoping there's a special syntax for retrieving quotes that i'm unaware of?
Here is the HTML i'm looking to scrape in the image below:
https://i.stack.imgur.com/dl6I8.png
Please see my comment below on further failed tests. Thanks.

Pasting constantly updating xpath into google sheets

I'm pretty fresh and trying to paste certain xpath from a website into sheets.
Url: "https://www.btcmarkets.net/"
Xpath: (from chrome copy xpath function) : //*[#id="LastPriceAUDBTC"]
I keep getting
formula parse error
I have managed to get the table headings on with:
Xpath: "//tr"
but not the information within
Is this even possible?
I know the google finance add-ons but I am analyzing the difference in prices of different exchanges.
QUERY #2
I would also like to
=importxml("http://www.xe.com/currencyconverter/convert/?Amount=1&From=EUR&To=CAD","//*[#id="ucc-container"]/span[2]/span[2]")
Should I be using =importDATA and shaving off what I don't want?
You need to use double quotes around the entire xpath but single quotes around the class name/id name/attribute name:
"//*[#id='LastPriceAUDBTC']"
And
=importxml("http://www.xe.com/currencyconverter/convert/?Amount=1&From=EUR&To=CAD","//*[#id='ucc-container']/span[2]/span[2]")

How to Scrape a Website Using Google Spreadsheet?

I have this website https://gpfo.memberclicks.net//index.php?option=com_community&view=profile&userid=23705974 and I am trying to extract the href link behind 'View' under 'Full Profile'.
I'd like to know how to scrape this. I tried //dl[1]/dd[contains(a/text(),'View')]/#href but it didn't return any data.
I'd also like to get an expert opinion on what the most efficient way to scrape websites is: is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
You try to query for the <dd>'s #href tag (which is not present). Try
//dd/a[. = 'View']/#href
instead. Or, staying closer to your original expression:
//dl[1]/dd/a[contains(text(),'View')]/#href
Is it better to directly run importXML from Google Docs or is there a better way to doing it using Scripts?
Depends on how complex things will get. If you just want to read some tabular data, you're probably better off with plain Spreadsheets; if it is more complicated writing your own script might be reasonable.

Resources