XPath for ImportXML in Gsheet - xpath

I am trying to use ImportXML in Google spreadsheet and got NA result. Error message:
Import XML content can't be parsed
URL: http://www.tripadvisor.com/Hotel_Review-g293916-d309884-Reviews-Indra_Regent_Hotel-Bangkok.html
This what I have:
importxml(url, "//img[#class='sprite-rating_rr_fill rating_rr_fill rr35']/#content")
That is what I want to grab:
the content attribute value of img
I am looking forward to your advice. I am not sure what I am doing wrong.

It's not your xpath that is wrong, rather is the source that is not a proper xml document (the img tag is not closed).
indeed, if you try to run:
=IMPORTXML( url, "//div[#class='rs rating']" )
it resolves to:
1,087 Reviews.
But any descendant of it will throw an error.
You could try pass the html source through a 'sanitizer' first, then it should work.

Related

Getting an error trying to pull out text using Google Sheets and importxml()

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml
As an example, this works fine
=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")
i.e. it looks for td, and pulls out b (which are postcodes in Canada)
But this code that looks for the error message does not work:
=importxml("https://www.awwwards.com/error1/", "//div/h1" )
I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."
...on this page https://www.awwwards.com/error1/
I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks
after quick trial and error with default formulae:
=IMPORTXML("https://www.awwwards.com/error1/", "//*")
=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)
=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)
=IMPORTDATA("https://www.awwwards.com/error1/")
it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)
You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and workaround:
I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.
So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.
Sample script for custom function:
Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.
function SAMPLE(url) {
return UrlFetchApp
.fetch(url, {muteHttpExceptions: true})
.getContentText()
.match(/<h1>([\w\s\S]+)<\/h1>/)[1]
.toUpperCase();
}
Result:
Note:
This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url, params)
muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()
If this was not the direction you want, I apologize.

How to figure out proper xpath for importxml in google sheets?

I am trying to import and use XML data from
http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581
to Google Sheets with function IMPORTXML,
trying to get a subject name, id, and address (etc.), but I cannot figure out proper xpath to get it, receiving Imported content is empty. I can do //*, but this is not exactly what i need. Could you suggest me some hints please?
For example you can use something like this as formula of a cell:
=importxml("http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581", "//*[name()='are:Shoda_ICO']")
In case you want to get multiple elements :
=importxml("http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_std.cgi?ico=00064581", "//*[name()='dtt:Nazev_obce' or name()='are:Shoda_ICO']")

Web Scraping Return Empty Value Using Xpath in Scrapy

Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!
You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.

Extracting a link with jmeter

So I need to delete an "onclick" dynamic link using jmeter.
Here is the sample of one of the links:
"Delete"
What I need is to extract number and post it in order to do the delete action. Every link is the same except the number.
I have tried to implement some of the solutions I've found on this site but it didn't work.
Thanks in advance
Peace
If you need to do it with XPath you could try going for substring-after function like:
substring-after(//a[text()='Delete']/#href,'param=')
The above expression returns everything which is after param= text in href attribute of a HTML tag having Delete text.
You can test your XPath expressions against actual server response using XPath Tester tab of the View Results Tree listener.
References:
substring-after Function Reference
XPath 1.0 Language Reference
Using the XPath Extractor in JMeter
XPath Tutorial

Assign a variable to xpath scrapy

Im using scrapy to crawl a webpage, the web page has 10+ links to crawl using |LinkExtractor, everything works fine but on the crawling of extracted links i need to get the page url. I have no other way to get the url but to use
response.request.url
How do i assign that value to
il.add_xpath('url', response.request.url)
If i do it like this i get error:
File "C:\Python27\lib\site-packages\scrapy\selector\unified.py", line
100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
exceptions.ValueError: Invalid XPath: http://www.someurl.com/news/45539/
title-of-the-news
And for description it is like this (just for refference):
il.add_xpath('descrip', './/div[#class="main_text"]/p/text()')
Thanks
The loader comes with two ways of adding attributes to the item, and is with add_xpath and add_value, so you should use something like:
...
il.add_value('url', response.url) # yes, response also has the url attribute

Resources