Web Scraping Return Empty Value Using Xpath in Scrapy - xpath

Really need the help from this community.
My question is that when I used the code
=========================================================================
response.xpath("//div[contains(#class,'check-prices-widget-not-sponsored')]/a/div[contains(#class,'check-prices-widget-not-sponsored-link')]").extract()
enter image description here
to extract the vendor name in scrapy shell, the output is empty. I really did not know why that happened, and it seems to me that the problem might be the website info is updating dynamically?
The url for this web scraping is: https://cruiseline.com/cruise/7-night-bahamas-florida-new-york-roundtrip-32860, and what I need is the Vendor name and Price for each vendor. Besides the attached pic is the screenshot of "the inspect".
Really appreciate the help!

You need to always check HTML source code in your browser (usually with Ctrl+U).
This way you'll find that information you want is embedded inside Javascript variables using JSON:
var partnerPrices = [{"pool":"9a316391b6550eef969c8559c14a380f","partner":"ncl.com","priority":0,"currency":"USD","data":{"32860":{"2018-02-25":{"Inside":579,"Suite":1199,"Balcony":699,"Oceanview":629},....
var sponsored_partners = [{"code":"CDCNA","name":"cruises.com","value":"cruises.com","logo":"\/images\/partner-logo-cruises-sm.png","logo_sprite":"partner-logo-cruises-com"},...
So you need to import json, parse response.body (using re or another method) and next json.loads() parsed JSON strings to iterate through two arrays.

Related

Getting an error trying to pull out text using Google Sheets and importxml()

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml
As an example, this works fine
=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")
i.e. it looks for td, and pulls out b (which are postcodes in Canada)
But this code that looks for the error message does not work:
=importxml("https://www.awwwards.com/error1/", "//div/h1" )
I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."
...on this page https://www.awwwards.com/error1/
I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks
after quick trial and error with default formulae:
=IMPORTXML("https://www.awwwards.com/error1/", "//*")
=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)
=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)
=IMPORTDATA("https://www.awwwards.com/error1/")
it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)
You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
Issue and workaround:
I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.
So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.
Sample script for custom function:
Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.
function SAMPLE(url) {
return UrlFetchApp
.fetch(url, {muteHttpExceptions: true})
.getContentText()
.match(/<h1>([\w\s\S]+)<\/h1>/)[1]
.toUpperCase();
}
Result:
Note:
This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.
References:
Custom Functions in Google Sheets
fetch(url, params)
muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()
If this was not the direction you want, I apologize.

I can't get my importxml statement to work - XML import problem or XPath problem?

I am trying to import data from a publicly available XML source into my Google Sheet. The data is available as a direct link from an HTML representation - the XML file contains the same data.
Reference: https://cordis.europa.eu/project/rcn/214839/factsheet/en and https://cordis.europa.eu/project/rcn/214839_en.xml
I've been browsing various different sources, and nothing helps so far. It seems that Google can read the data because of the statements
IMPORTXML("https://cordis.europa.eu/project/rcn/214839_en.xml", "*")
and
=IMPORTXML("https://cordis.europa.eu/project/rcn/214839_en.xml", "//*")
produce results.
However, simple XPath statements such as //rcn or /project/rcn/text() bark back at me with an "N/A" result.
What am I doing wrong here?
You'll want to use the local-name function.
=IMPORTXML("https://cordis.europa.eu/project/rcn/214839_en.xml","//*[local-name()='rcn']")

Xpath is correct but Scrapy spider doesn't work

I'm trying to download from a webpage, I identify the XPath expression and then run the spider, but nothing is downloaded.
The webpage: https://octopart.com/electronic-parts/integrated-circuits-ics
Here is the code:
for product in response.xpath("//div[#class='serp-card-header media']/div[#class='media-body']"):
yield {'name': product.xpath("//a/span[#class='part-card-manufacturer']/text()").extract_first()}
This website seems to be using some simple bot detection. You are most likely using the default scrapy user agent. So instead you need to set a real user agent in your settings.py:
USER_AGENT = '[replace with a real user agent]'
Refer to the documentation.
After doing this you will get some results. However, your XPath is incorrect as well. Inside the for loop, when you do a relative XPath, it needs to start with .//a/span.... See here for the reason why: https://docs.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths

Extract value from javascript object in site using xpath and import.io

I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.
For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.
support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.

Assign a variable to xpath scrapy

Im using scrapy to crawl a webpage, the web page has 10+ links to crawl using |LinkExtractor, everything works fine but on the crawling of extracted links i need to get the page url. I have no other way to get the url but to use
response.request.url
How do i assign that value to
il.add_xpath('url', response.request.url)
If i do it like this i get error:
File "C:\Python27\lib\site-packages\scrapy\selector\unified.py", line
100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
exceptions.ValueError: Invalid XPath: http://www.someurl.com/news/45539/
title-of-the-news
And for description it is like this (just for refference):
il.add_xpath('descrip', './/div[#class="main_text"]/p/text()')
Thanks
The loader comes with two ways of adding attributes to the item, and is with add_xpath and add_value, so you should use something like:
...
il.add_value('url', response.url) # yes, response also has the url attribute

Resources