Incorrect URL/XPath when using Google sheets IMPORTXML - xpath

I'm trying to import a search result from google to my spreadsheet. I've had success with Wikipedia pages, but for some reason, Google search isn't working correctly (giving a "could not fetch url" error). I'm sure the problem is somewhere in my URL or XPath, but I've been trying a variety of things and I'm lost. Here is what I've got:
=IMPORTXML("https://www.google.com/search?q=dom+fera+easy+thing+released", "//div[#class='Z0LcW XcVN5d']")
I'm linking the spreadsheet below as view-only for reference as well. Ultimately the goal is to be able to webscrape release years of songs. I'd appreciate any help!
https://docs.google.com/spreadsheets/d/1bt8MJ23nfGAv6ianaR-sd7DM5DNn98p7zWSG1UzBlEY/edit?usp=sharing

AFAIK, you can't parse results from GoogleSearch in Google Sheets.
Using Discogs, MusicBrainz, All Music... to get the release dates could be useful.
But it seems some of your groups are little known. So, you can use Youtube to fetch the dates.
Note : we assume the year of publication on Youtube corresponds to the year of release.
Of course, that's not 100% true. For example, artists can clip their video months after release. Or publish nothing on Youtube.
So this method will work with a wide range of songs but not ALL the songs. With recent bands and songs, it should be OK.
To do this you can use the Youtube API or IMPORTXML formulas. In both cases, we always take the first result (relevant order) of the search engine as source.
You need an API key and an ImportJSON script (credits to Brad Jasper) to use the API method. Once you have installed the script and activated your API key,you can paste in cell B3:
="https://www.googleapis.com/youtube/v3/search?key={yourAPIKey}&part=snippet&type=video&filter=items&regionCode=FR&q="&ENCODEURL(A3)
We generate the url to query with the content you input in column A.
We use "regionCode=FR" since some songs are not available in the US ("i need you FMLYBND"). That way we get the correct release date.
In C3, you can paste :
=LEFT(QUERY(ImportJSON(B3);"SELECT Col11 LIMIT 1 label Col11''";1);4)
We parse the JSON, select the column of interest, the line of interest, then we clean the result.
With the IMPORTXML method, you can paste in E3 :
="https://www.youtube.com"&IMPORTXML("https://www.youtube.com/results?search_query="&A3;"(//div[#class='yt-lockup-thumbnail contains-addto'])[3]/a/#href")
We construct the url with the first search result of the search engine.
In F3, you can paste :
=LEFT(IMPORTXML(E3;"//meta[#itemprop='datePublished']/#content");4)
We parse the previously built url, then we extract the year of publication.
As you can see, there's a difference in the results on line 5. That's because the song is not available in the US. The first result returned in the IMPORTXML method is different from the one of the API method which uses a "FR" flag.
Side note : I'm based in Europe. So ";" in the formulas should be replaced with ",".

google does not support web scraping of google search into google sheets. this option was disabled 2 years ago. you will need to use alternative search engine

Related

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

Extracting Financial Statement Data Using Google Sheets + IMPORTXML [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to import, into google sheets, the last quarter's research and development expense for a few thousand companies from their financial statements. While I want to import several different elements from financial statements, the last quarter R&D expense is currently pertinent (and potentially the previous 3 quarters).
I have tried several different sites (yahoo finance, bloomberg, etc) but the simplest URL seems to be from stockrow.com because I can simply automate the substitution of the stock ticker in the URL.
To get the xpath, I inspect the element and copy the xpath using the browser (have tried with Chrome and Firefox).
I am using IMPORTXML on googlesheets and, on my last attempt, used the following input: =IMPORTXML("https://stockrow.com/JNJ/financials/income/quarterly","/html/body/div[1]/div/div/section/div/div[2]/div[1]/section[4]/div/div[3]/div/div/div[3]/div/div/div[11]/div/span")
I have attempted all sorts of combinations of sites, browsers, and xpaths related to the element, but no matter what I do, I always get the same error "Imported content is empty."
I read xpath google sheet importxml but can't make sense of what is happening in the change to the xpath or how to solve this particular challenge.
Because I want this to be repeatable across multiple stock tickers in google sheets, I am hoping that the "location" of the R&D expense (and other elements in the financial statement) are consistent across all pages, rather than just a specific solution to this challenge.
Looking forward to receiving guidance. Thanks!!
you need some other source. Google Sheets does not support the scraping of JavaScript elements. you can test JS dependency simply by disabling JS for a given site and what's left can be scraped. in your case its nothing:

Google Sheets IMPORTXML Text Field from Website

I am trying to dynamically pull in car values for cars matching specific criteria on Kelley Blue Book. I have this IMPORTXML query that has a link to the specific page that shows the trade-in value of the car.
=IMPORTXML("https://www.kbb.com/Api/3.9.462.0/71553/vehicle/upa/PriceAdvisor/meter.svg?action=Get&intent=trade-in-sell&pricetype=FPP&zipcode=12345&vehicleid=411852&selectedoptions=6762567|true|6762674|false|6762900|false|6762905|false|6762909|false|6762913|false|6762915|true|6762926|false|6762928|false&hideMonthlyPayment=False&condition=verygood&mileage=40000", "//text[#y='-8']")
In this URL, there is a text field that has the y coordinate as -8. I was hoping that it would be sufficient to identify the data I want to pull in (The trade-in value). I get the standard Can't fetch URL error and can't figure out why.
the issue is not within your XPath "//text[#y='-8']" but with the website itself.
basically you have two options to test if the website can be scraped:
=IMPORTXML("URL", "//*")
where XPath //* means "everything that's possible to scrape"
and direct source code scrape method:
=IMPORTDATA("URL")
sometimes is source code just huge and Google Sheets can't handle it so this needs to be restricted a bit like:
=ARRAY_CONSTRAIN(IMPORTDATA("URL"), 10000, 10)
anyway, non of these can scrape anything from your URL

Google sheet XPath scraping: copy currency exchange rate

From this URL https://www.xe.com/currencyconverter/convert/?Amount=1&From=MYR&To=INR I want to copy the data into my google sheets.
in cell A1 I have https://www.xe.com/currencyconverter/convert/?Amount=1&From=MYR&To=INR
in cell A2 I have =IMPORTXML(A1,"//span[#class='converterresult-toAmount']")
I get output N\A
Can someone advise me how?
Alternatively, you can use the GOOGLEFINANCE formula to fetch the FX rates from Google Finance directly:
=index(GOOGLEFINANCE("CURRENCY:MYRINR","price",today(),1,"DAILY"),2,2)
This function will return the daily FX rate for MYR-INR for today.
See GOOGLEFINANCE documentation for more details about the variations you can use to get more / different data.
I wrapped the Google Finance formula into an INDEX function to only get the rate (so you can use that in a multiplication to convert random amounts), as the GOOGLEFINANCE formula returns a table with dates and history by default.
unfortunately, that won't be possible because the site is controlled by JavaScript and Google Sheets can't understand/import JS. you can test this simply by disabling JS for a given link and you will see a blank page:

How does Market Samurai and Long Tail Pro handle retrieving the top 10 Google search results for a keyword?

I'm curious to know how Market Samurai, Long Tail Pro and other software handle retrieving the top 10 Google search results and not running into limits. It appears that these software packages use the users own Google account. Google Custom Search limits users to 100 queries per day (the free limit) but people tend to do keyword research on hundreds or even thousands of keywords per day and don't pay any additional amounts to Google.
Are they paying extra for this service, are they using a different API (perhaps the Adwords API?) or are they scraping the Google search results page (violation of TOS)? Really would like to know! Thanks.
i have done this in one of my project (in java).
this is very simple, in java there is one library call JSoup by using this library you can send get request to google, for example:
https://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=<your url encoded search term>
this will return you an HTML code of google search result with your own term.
using Jsoup u can find specific HTML tag with specific class or id. this concept helps you to extract url link, title and description from google search result.
for working example check here, in that example you can extract google serach result links with custom search term.
i hope this will help you.

Resources