Getting an error trying to pull out text using Google Sheets and importxml() - xpath

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml
As an example, this works fine
=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")
i.e. it looks for td, and pulls out b (which are postcodes in Canada)
But this code that looks for the error message does not work:
=importxml("https://www.awwwards.com/error1/", "//div/h1" )
I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."
...on this page https://www.awwwards.com/error1/
I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks

after quick trial and error with default formulae:
=IMPORTXML("https://www.awwwards.com/error1/", "//*")
=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)
=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)
=IMPORTDATA("https://www.awwwards.com/error1/")
it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)

You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and workaround:
I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.
So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.
Sample script for custom function:
Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.
function SAMPLE(url) {
return UrlFetchApp
.fetch(url, {muteHttpExceptions: true})
.getContentText()
.match(/<h1>([\w\s\S]+)<\/h1>/)[1]
.toUpperCase();
}
Result:
Note:
This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url, params)
muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()
If this was not the direction you want, I apologize.

Related

Receiving no translated text from an https request to translate.google.com

Now I tried to make like a translator through Roblox Studio using Https service by sending a request to the translate.google.com link the thing is that anything I get in return does not give me the translated text.
I put what I received in a google doc and tried to find it by pressing ctrl + f to try to find it but no luck the only thing I could find is that text that was supposed to be translated. Here is the code in case you want to try it for yourself but I do warn you that running this might make Roblox unresponsive for a while since it is a lot of info they gave back.
I don't know if I am doing something wrong or not someone please help! I just want it to give me what 'Hello world' would be in french, there are also no error messages.
local http = game:GetService("HttpService")
local Message = "Hello world"
http:UrlEncode(Message) -- 'Hello world' -> 'Hello%20world'
local response = http:RequestAsync(
{
Url = "https://translate.google.com/?sl=en&tl=fr&text=" .. Message .. "!&op=translate";
Method = "GET"
}
)
if response.Success then
print(response.StatusMessage)
print(response.StatusCode)
print(response.Body)
--print(response.Headers)
else
print("The request failed: ", response.StatusCode, response.StatusMessage)
end
When visiting on your browser (for example) the url https://translate.google.com/?sl=en&tl=fr&text=Hello%20World!&op=translate, the translation you see is fetched using Javascript code executed by the browser after loading the page.
The browser retrieves the html body of the page (like you did in your code) and then executes the javascript in the html body which retrieves the translation and updates the page.
Unless you use a browser driver like Selenium I don't see how you can do what you want in a simple way.
Plus, I'm sure that Google has some protection against automatic bots, so after too many request your program will probably will be blocked by ReCaptcha.
The correct way to translate the text is to use the Google Cloud Translate API which I think is free up to 500k requests per month. There is also Azure Translator from Microsoft which also has a free tier.
Your issue is likely in how you are URL Encoding the string.
http:UrlEncode(Message)
HttpService.UrlEncode returns the encoded string as a new value. It doesn't mutate the existing value. So you just need to store the result of the function call.
Message = http:UrlEncode(Message)
EDIT : Just as #Mohamed AMAZIRH pointed out, hitting this URL will only return HTML.

Web Scraping Return Empty Value Using Xpath in Scrapy

Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!
You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.

Google Global Address List for Domain limited to 250 results [rehash]

Reposting due to lack of answers
I'm trying to query Google Global Address List for a specific domain, being led by this answer here (Specifically the answer by Jay Lee).
It's all well and good and works perfectly in Google's OAuth Playground, however it seems to be limited to 250 users. Given that this feature seems completely undocumented, and that I can't tell by looking at their github repo (specifically this file), does anyone know how to query for the next 250 users or how to set the number of results?
Thank you!
My last answer was converted into a comment, but I now have a complete answer to this question.
EDIT: I repeatedly mention JSON but you don't need to parse this in JSON to get the syncToken. Google will provide you the syncToken regardless.
The Google GAL API seems to operate in a similar manner to the Google Calendar or YouTube API's "nextPageToken" parameter, which allows you to query the next page of results as long as you have a token.
The Google GAL uses the "syncToken" parameter as a replacement. Much like the other Google API's, if you append this syncToken to the end of your URL, you will get the next page of results. Note that I was unable to get the startIndex parameter to work (which allows you to begin at a specific item in the JSON), so if you are trying to get a specific result through the query you will most likely have to parse the entire JSON file. Inside the parsed JSON, there should be a key called "gal$syncToken." You can find it in the json right before the "entry" key/array which is where all of the global domain contacts are listed.
Here is an example of what it would look like:
"gal$syncToken": {
"$t": "0_1001_17011_AAN3FFNH2AEJ3SFBKZDRS36AHRXHND47YACXLGFJF5QBXCSQLUK57EX3LC765CU5IWG6ZXWHPS5WKHSDFJ26LRI5FRIVIQ3Z532PWKG3ZG45JW3RVCDZMWK5LLLHZSCBTJH5U6Q4LZRG4PKWQE42AOIPC4VJCZQIP5MBJHNUBZZJNLKISKETTQ6DNTRAPTI"
},
Your syncToken will look different. To get the next page of results, Google actually provides a link that already has the syncToken pre-appended:
{
"rel": "next",
"type": "application/atom+xml",
"href": "https://www.google.com/m8/feeds/gal/example.com/full?sync-token=0_1001_17011_AAN3FFNH2AEJ3SFBKZDRS36AYRXHND47YACXLGFJF5QBXCSQLUK57EX3LC765CU5IWG6ZXWHPS5WKHSDFJ26LRI5FRIVIQ3Z562PWKG3ZG45JW3RVCDZMWK5LLLH6SCBTJH5U6Q4LZRG4PKWQE42AOIPC4VJCZQIP5MBJHNUBZZJNLKBSKETTQ6DNTRAPTI"
},
However, this will only give you the next 250 results. If you want more on each page, you can append the max-results parameter to this and have Google spit it back out in JSON format.
https://www.google.com/m8/feeds/gal/example.com/full?sync-token=INSERT_SYNCTOKEN_HERE&alt=json&max-results=10000
Hope this helped anyone seeing this. The Google GAL seems to be largely undocumented for the most part and I was unable to find an answer to this question anywhere else on the internet. If anyone from Google can confirm that this is the best method for accomplishing this that would be great!

Extract value from javascript object in site using xpath and import.io

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.
For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.
support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.

Uncaught SyntaxError: Unexpected token <, when calling angularJS $http.jsonp

I'm trying to work with the IUCN Red List web services API (here's an example output). Unfortunately I haven't been able to find any documentation other than this one-off Gist. It looks as though the API is constructing an HTML document rather than returning a data object, which isn't something I've experienced in the past. I also notice that in the example there is no mention of a ?callback=JSON_CALLBACK in the URL, which I would expect when dealing with JSONP.
I've constructed an http request in AngularJS like so:
atRiskApp.controller('IucnController', ['$scope', '$routeParams', '$http', function ($scope, $routeParams, $http) {
$scope.iucn = $routeParams.iucn; // pulling a number from the URL: ex. 22718591
$scope.getIUCN = function () {
var iucnUrl = 'http://api.iucnredlist.org/details/' + $scope.iucn + '/0.js';
$http.jsonp( url )
.success( function (response) {
console.log(response);
})
.error( function (response) {
console.log(response);
});
};
}]);
Although the HTML document is being successfully passed to my app I'm getting the following error message:
Uncaught SyntaxError: Unexpected token <
It seems like the app is expecting to get Javascript, and is instead getting an HTML document, which it apparently can't parse. I've tried adding a config object to the request based on the angular docs: $http.jsonp( {url: iucnUrl, responseType: 'text'} ) without any luck.
My question is, how do I work with the returned HTML document, or am I way off track here?
Response from the API is an HTML document with a javascript extension:
On the page you linked to in your comment , I found some potentially useful information under the heading API Index.
You can actually get JSON for all levels of taxonomy, including your example Aneides aeneus. However, this JSON doesn't include all of the data from the HTML version, so it's not as useful. Hopefully this helps a little.
API Index (excerpt)
It is also possible to retrieve the row(s) of the index corresponding to an individual species:
http://iucn-redlist-api.heroku.com/index/species/panthera-leo.json
You can use dashes for spaces, as a convenient replacement for the standard URL escape, %20.
The HTML format contains direct links to the species account pages. The CSV and JSON formats include a species_id column which can be used to construct species account URLs as follows:
http://iucn-redlist-api.heroku.com/details/species_id/0
To use the index JSON in Web pages directly, you may need JSONP padding; use the “.js” extension and add a “callback” parameter with the name of the function to use.
http://iucn-redlist-api.heroku.com/index/genus/Dioscorea.js?callback=show
I diagonally looked over the website and its sitemap and found no reference to a public API. All the output is HTML, and it makes sense that json parse method jsonp will not be able to make sense of it. First < it encounters, it will fail (as is apparent).
First of all, I would contact the site admin to simply ask if there is an API that will yield you XML or json or some other object notation that's convenient to work with.
Then there's the scenario where his or her answer would be 'no':
Parsing HTML is not something to be taken lightly and certainly not something you would write yourself unless absolutely necessary.
Luckily, there are ways to get data from html using jQuery.parseHTML(), pure ('vanilla') javascript ways you can use from within AngularJS and full-blown HTML parsing libraries such as HTML Agility Pack(for use in .NET), all of which can get you to the heart of the data within the DOM nodes you're trying to poke at.
There are many other libraries that might serve you better, but these examples will give you a good starting point to canvas the landscape of HTML parsing. This will take some looking into, but it will be more than worth it.

Resources