Importxml - could not fetch url - xpath

I am trying to get data from this website http://www.npo.gov.za/PublicNpo/Npo into Google Sheets
I am using the importxml function:
=IMPORTXML("www.npo.gov.za/PublicNpo/Npo", "//table//tbody//tr//td[1]|//table//tbody//tr//td[20]")
I get the error: could not fetch url www.npo.gov.za/PublicNpo/Npo

How about adding http:// to the URL as follows?
From :
=IMPORTXML("www.npo.gov.za/PublicNpo/Npo", "//table//tbody//tr//td[1]|//table//tbody//tr//td[20]")
To :
=IMPORTXML("http://www.npo.gov.za/PublicNpo/Npo", "//table//tbody//tr//td[1]|//table//tbody//tr//td[20]")

Related

Filtering on golang API using chi router

I'm using GO and chi router trying to create and endpoint filtering a table by status, but when I use the ? on the URL I'm receiving 404 page not found.
I am using the following code:
r.Get("/table?status={status}", l.Hanlder(tableHandler.GetByStatus))
When I remove the ? it works just well. I can't use it to filter? I can not, how can I do it?
Thanks in advance.
You can use the URLParam or URLParamFromCtx function to get the value of a query string parameter.

How to get a correct xpath for use in Google Spreadsheets

[The image in the link directs to the line where my target information is placed]
1 I'm trying to get a correct xpath and use it with IMPORTXML function.
what i need is "22.09.20"
That's what i get when i copy the xpath of the object on the website
/html/body/div[2]/div/div1/div[2]/div[2]/div[2]/div[2]/div[2]/span[2]
IMPORTXML function returns N/A

Getting an error trying to pull out text using Google Sheets and importxml()

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml
As an example, this works fine
=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")
i.e. it looks for td, and pulls out b (which are postcodes in Canada)
But this code that looks for the error message does not work:
=importxml("https://www.awwwards.com/error1/", "//div/h1" )
I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."
...on this page https://www.awwwards.com/error1/
I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks
after quick trial and error with default formulae:
=IMPORTXML("https://www.awwwards.com/error1/", "//*")
=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)
=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)
=IMPORTDATA("https://www.awwwards.com/error1/")
it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)
You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and workaround:
I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.
So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.
Sample script for custom function:
Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.
function SAMPLE(url) {
return UrlFetchApp
.fetch(url, {muteHttpExceptions: true})
.getContentText()
.match(/<h1>([\w\s\S]+)<\/h1>/)[1]
.toUpperCase();
}
Result:
Note:
This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url, params)
muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()
If this was not the direction you want, I apologize.

How to figure out proper xpath for importxml in google sheets?

I am trying to import and use XML data from
http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581
to Google Sheets with function IMPORTXML,
trying to get a subject name, id, and address (etc.), but I cannot figure out proper xpath to get it, receiving Imported content is empty. I can do //*, but this is not exactly what i need. Could you suggest me some hints please?
For example you can use something like this as formula of a cell:
=importxml("http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581", "//*[name()='are:Shoda_ICO']")
In case you want to get multiple elements :
=importxml("http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581", "//*[name()='dtt:Nazev_obce' or name()='are:Shoda_ICO']")

Assign a variable to xpath scrapy

Im using scrapy to crawl a webpage, the web page has 10+ links to crawl using |LinkExtractor, everything works fine but on the crawling of extracted links i need to get the page url. I have no other way to get the url but to use
response.request.url
How do i assign that value to
il.add_xpath('url', response.request.url)
If i do it like this i get error:
File "C:\Python27\lib\site-packages\scrapy\selector\unified.py", line
100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
exceptions.ValueError: Invalid XPath: http://www.someurl.com/news/45539/
title-of-the-news
And for description it is like this (just for refference):
il.add_xpath('descrip', './/div[#class="main_text"]/p/text()')
Thanks
The loader comes with two ways of adding attributes to the item, and is with add_xpath and add_value, so you should use something like:
...
il.add_value('url', response.url) # yes, response also has the url attribute

Resources