I have a raw xml file on Gist:
https://gist.githubusercontent.com/EmDubeu/196d95b561fa83a4ef360654ed919fe5/raw/9e2dde8d08a2ea4e45871bf8c55693334f8a69e1/NEIPA.xml
I store the above url in a cell in my Google spreadsheet (Settings!E27).
I'm trying to use importxml from my google sheet with the following formula:
=IMPORTXML(Settings!E27, "//HOP/NAME"), but it returns "Error Imported Xml content can not be parsed."
My formula works with this url:
http://www.beerxml.com/recipes.xml
Why is it not working with my Gist hosted xml file?
GitHub is not for file-hosting and Content-Type headers is not set properly. If you type http://www.beerxml.com/recipes.xml in the browser, it will render the page as XML contents, but not for your https://gist.githubusercontent.com/EmDubeu/... since it cannot recognize it as a XML page.
In this case, people(at least, I) usually use sites like https://rawgit.com/. For your gist file, rawgit URL is https://rawgit.com/EmDubeu/196d95b561fa83a4ef360654ed919fe5/raw/fcb019a0db249ea90a9512f9162725547f4a43b5/NEIPA.xml.
But when I type this URL, my browser says it cannot parse the page because of characters like &. It should be HTML(XML) character encoded. You can verify this by viewing the source of http://www.beerxml.com/recipes.xml, in which & is encoded to & properly. You should html-encode your gist too.
Insert a break line between <?xml version="1.0" encoding="ISO-8859-1"?> and <RECIPES>
Related
When getting tweet information using the twitter API, the returned text or full_text field has a URL appended at the end of the text. For example:
"full_text": "Just another Extended Tweet with more than 140 characters, generated as a documentation example, showing that [\"truncated\": true] and the presence of an \"extended_tweet\" object with complete text and \"entities\" #documentation #parsingJSON #GeoTagged https://twitter.com/FloodSocial/status/994633657141813248"
https://twitter.com/FloodSocial/status/994633657141813248 is appended at the end(The appended url is acutally a shortened url but stackoverflow does not allow shortened url in the body so I just replace it with the full URL). Why does the API add this and is there a way to get the text without the added URL?
Are you using the correct twitter gem? using gem install twitter and setting up a client according to the docs, you should be able to just get the tweet/status by it's ID. But whatever example you are using doesn't show how you got the full text
text = client.status('994633657141813248').text
=>"Just another Extended Tweet with more than 140 characters, generated as a documentation example, showing that https://twitter.com/FloodSocial/status/994633657141813248"
The url is truncated as a plain string so not sure what you even do to get the string you formulated.
But if you have some long string somehow with the url embedded, you could do
text.split(/\shttp?s/).first
That looks like a quote Tweet where the original Tweet URL is included?
[edit - I was wrong with the above statement]
I see what is happening. The original Tweet links to an image on Twitter (https://twitter.com/FloodSocial/status/994633657141813248/photo/1, via a shortened tco link). Twitter hides the image URL in the rendered Tweet, but returns it in the body of the text. That's the expected behaviour in this case. You can also see the link parsed out in the extended_entities segment of the Tweet data, as well as the image data itself in the same area of the Tweet. If you want to omit the URL from the text data, you'll need to trim it yourself.
I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml
As an example, this works fine
=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")
i.e. it looks for td, and pulls out b (which are postcodes in Canada)
But this code that looks for the error message does not work:
=importxml("https://www.awwwards.com/error1/", "//div/h1" )
I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."
...on this page https://www.awwwards.com/error1/
I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks
after quick trial and error with default formulae:
=IMPORTXML("https://www.awwwards.com/error1/", "//*")
=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)
=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)
=IMPORTDATA("https://www.awwwards.com/error1/")
it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)
You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and workaround:
I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.
So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.
Sample script for custom function:
Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.
function SAMPLE(url) {
return UrlFetchApp
.fetch(url, {muteHttpExceptions: true})
.getContentText()
.match(/<h1>([\w\s\S]+)<\/h1>/)[1]
.toUpperCase();
}
Result:
Note:
This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url, params)
muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()
If this was not the direction you want, I apologize.
I am trying to use ImportXML in Google spreadsheet and got NA result. Error message:
Import XML content can't be parsed
URL: http://www.tripadvisor.com/Hotel_Review-g293916-d309884-Reviews-Indra_Regent_Hotel-Bangkok.html
This what I have:
importxml(url, "//img[#class='sprite-rating_rr_fill rating_rr_fill rr35']/#content")
That is what I want to grab:
the content attribute value of img
I am looking forward to your advice. I am not sure what I am doing wrong.
It's not your xpath that is wrong, rather is the source that is not a proper xml document (the img tag is not closed).
indeed, if you try to run:
=IMPORTXML( url, "//div[#class='rs rating']" )
it resolves to:
1,087 Reviews.
But any descendant of it will throw an error.
You could try pass the html source through a 'sanitizer' first, then it should work.
AM developing a auto completion or suggestion box using AJAX and servlets . My problem is how to parse the XML response in java script to show it in a div tag.My xml response is like contains one parent tag RESULTS , it contains number of children tags called RESULT.
How to get result values in java script variables.
You could do this using jQuery.parseXML
Take look on this thread: Parse XML from XMLHttpRequest
I think it contains information you need.
I have a currency converter application for iphone which uses web service. The web service is returning result in the following format:-
<Date>4/5/2010</Date>
<Time>7:18:09 AM</Time>
<Amount>20</Amount>
<ExchangeRate>44.7336419443466</ExchangeRate>
<Result>894.672839</Result>
I'm storing the whole xml file in an NSString variable called theXML. I want to show the value inside the result tag.How can i read the data from the xml file or from the string..
Thanks in advance..
Use XPath.