Using Request Header and Body information for Ajax query with scrapy - ajax

I am scraping the following web site https://www2.woolworthsonline.com.au
My first step is to get a list of product categories, which I can do.
Then I need to get sub-categories. The sub-categories seem to be dynamically generated. I have set up what I thought to be the right header and body information for the request and a small test call-back to see if the request is working correctly. My code never gets to the callback, so presumably there is something wrong with the request
def parse(self, response):
# Get links to the categories
categories_links = response.xpath('//a[#class="navigation-link"]').re(r'href="(\S+)"')
for link in categories_links:
# Generate and yield a request for the sub-categories
get_request_headers = dict()
get_request_headers['Accept'] = 'application/xml, text/xml, */*; q=0.01'
get_request_headers['Accept-Encoding'] = 'gzip, deflate, sdch'
get_request_headers['Accept-Language'] = 'en-US,en;q=0.8'
get_request_headers['Connection'] = 'keep-alive'
get_request_body = urllib.urlencode(
{'_mode' : 'ajax',
'_ajaxsource' : 'navigation-panel',
'_referrer' : link,
'_bannerViews' : '6064',
'_' : '1429428492880'}
)
url_link = 'https://www2.woolworthsonline.com.au'+link
yield Request(url=url_link, callback=self.subcategories, headers = get_request_headers, method='POST', body = get_request_body, meta={'category_link' : link} )
return
def subcategories(self, response):
print "sub-categories test: ", response.url
return

Try this out,
BASE_URL = 'https://www2.woolworthsonline.com.au'
def parse(self, response):
categories = response.xpath(
'//a[#class="navigation-link"]/#href').extract()
for link in categories:
yield Request(url=self.BASE_URL + link, callback=parse_subcategory)
def parse_subcategory(self, response):
sub_categories = response.xpath(
'//li[#class="navigation-node navigation-category selected"]/ul/li/span/a/#href').extract()
for link in sub_categories:
yield Request(url=self.BASE_URL + link, callback=parse_products)
def parse_products(self, response):
# here you will get the list of products based on the subcategory
# extract the product details here
First we extract the category list from the start-url in parse function
Then give request to extracted category-url ( we have to provide BASE_URL along with the extracted category-url, since that urls are given as relative urls not the absolute one)
In the callback function (here parse_subcategory) we can get the response of each category-url
Likewise we are doing the same thing for sub-categories also
Finally extract the product details

Related

Scrapy Xpath returns empty output

I try to scrapy this data (the 71) from this line of code here:
<span class="text-pill text-pill--steel tooltip tooltipstered" data-options="{"theme": "white"}">71</span>
from the website
https://www.attheraces.com//racecard/Hamilton/28-September-2020/1330
I tried
class ErgebnisseSpider(scrapy.Spider):
name = 'namen'
allowed_domains = ['www.attheraces.com/']
start_urls = ['https://www.attheraces.com//racecard/Hamilton/28-September-2020/1330']
def parse(self, response):
starterkomplett = response.xpath('//div[#class="column width--tablet-wide-18"]')
for rennen2 in starterkomplett:
rating = rennen2.xpath('//span[#class="text-pill text-pill--steel tooltip tooltipstered"]').getall()
yield {
'rennen2_rating' : rating,
}
But I am not getting the "74" rating. Neither with /text() nor without, but with the inspection tool from chrome I can select the span with the aformentioned code... what am I missing here?
With other values on the side the code works, but not for the rating...
I am kind of new and learning, but I did not come any further googleing myself, so I guessed I ask here - again - Sorry for beein such a noob :P
See the page source
response.xpath("//*[#class='text-pill text-pill--steel tooltip']//text()").getall()

How do I enable following URL capability to work in my code?

I am attempting to add the follow url capability but can't seem to get it to work. I need to crawl all the pages. There are around 108 pages of the job listings. Thank you.
import scrapy
class JobItem(scrapy.Item):
# Data structure to store the title, company name and location of the job
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class PythonDocumentationSpider(scrapy.Spider):
name = 'pydoc'
start_urls = ['https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab']
def parse(self, response):
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href'):
follow_url = response.urljoin(follow_href.extract())
yield scrapy.Request(follow_url, callback=self.parse_page_title)
for a_el in response.xpath('//div[#class="-job-summary"]'):
section = JobItem()
section['title'] = a_el.xpath('.//a[#class="s-link s-link__visited job-link"]/text()').extract()[0]
span_texts = a_el.xpath('.//div[#class="fc-black-700 fs-body1 -company"]/span/text()').extract()
section['company'] = span_texts[0]
section['location'] = span_texts[1]
print(section['location'])
#print(type(section))
yield section
I am attempting to get the following url capability to work with my code and then be able to crawl the pages and store job postings in csv file.
.extract() return a list. In most cases you'll need to use .get() or .extract_first() instead if you don't need a list.
First you need to rewrite this part:
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href').getall(): # or .extract()
follow_url = response.urljoin(follow_href)
yield scrapy.Request(follow_url, callback=self.parse_page_title)

Xpath selector not working in Scrapy

I'm trying to extract the text from this Xpath:
//*/li[contains(., "Full Name")]/span/text()
from this webpage:
http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs
I've tested it in Google Chrome's Console (which works), as with many other variations of the Xpath, but I can't get it to work with Scrapy. My code only returns "{}".
Here's where I have been testing it in my code, for context:
def parse_bio(self, response):
loader = response.meta['loader']
fullnameValue = response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
loader.add_value('fullName', fullnameValue)
return loader.load_item()
The problem isn't my code (I don't think), it works fine with other (very broad) Xpath selectors. But I'm not sure what's wrong with the Xpath. I have JavaScript disabled, if that makes a difference.
Any help would be great!
Edit: Here is the rest of the code to make it more clear:
from scrapy import Spider, Request, Selector
from votesmart.items import LegislatorsItems, TheLoader
class VSSpider(Spider):
name = "vs"
allowed_domains = ["votesmart.org"]
start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]
def parse(self, response):
for href in response.xpath('//h5/a/#href').extract():
person_url = response.urljoin(href)
yield Request(person_url, callback=self.candidatesPoliticalSummary)
def candidatesPoliticalSummary(self, response):
item = LegislatorsItems()
l = TheLoader(item=LegislatorsItems(), response=response)
...
#populating items with item loader. works fine
# create right bio url and pass item loader to it
bio_url = response.url.replace('votesmart.org/candidate/',
'votesmart.org/candidate/biography/')
return Request(bio_url, callback=self.parse_bio, meta={'loader': l})
def parse_bio(self, response):
loader = response.meta['loader']
print response.request.url
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
return loader.load_item()
I figured out my problem! Many pages on the site were login protected, and I wasn't able to scrape from pages that I couldn't access in the first place. Scrapy's form request did the trick. Thanks for all the help (especially the suggestion of using view(response), which is super helpful).
The expression is working for me in the shell perfectly as is:
$ scrapy shell "http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs"
In [1]: response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
Out[1]: [u'Norma Smith']
Try using the add_xpath() method instead:
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')

Not able to scrape more then 10 records using scrapy

I'm new to scrapy and python. I'm using scrapy for scraping the data.
The site using AJAX for pagination so I'm not able to get the data more than 10 records I'm posting my code
from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re
class JustdialSpider(Spider):
name = "JustdialSpider"
allowed_domains = ["justdial.com"]
start_urls = [
"http://www.justdial.com/Mumbai/Dentists/ct-385543",
]
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
questions = Selector(response).xpath('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
for question in questions:
item = JustdialItem()
item['name'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
item['contact'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/b/text()').extract()
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(item['name'], item['contact']))
f.close()
return item
# if running code above this I'm able to get 10 records of the page
# This code not working for getting data more than 10 records, Pagination using AJAX
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
next_page = int(re.findall('page=(\d+)', url)[0]) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
print next_url
def parse_ajaxurl(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
Please help me
Thanks.
Actually if you disable javascript when viewing the page you'll notice that site offers traditional pagination instead of "never ending" AJAX one.
Using this you can simply find url of next page and continue:
def parse(self, response):
questions = response.xpath('//div[contains(#class,"store-details")]')
for question in questions:
item = dict()
item['name'] = question.xpath("h4/span/a/text()").extract_first()
item['contact'] = question.xpath("p[#class='contact-info']//b/text()").extract_first()
yield item
# next page
next_page = response.xpath("//a[#rel='next']/#href").extract_first()
if next_page:
yield Request(next_page)
I also fixed up your xpaths but in overal the only bit that changed is those 3 lines under # next page comment.
As a side note I've noticed you are saving to csv in spider where you can use built-in scrapy exporter command like:
scrapy crawl myspider --output results.csv

Access session cookie in scrapy spiders

I am trying to access the session cookie within a spider. I first login to a social network using in a spider:
def parse(self, response):
return [FormRequest.from_response(response,
formname='login_form',
formdata={'email': '...', 'pass':'...'},
callback=self.after_login)]
In after_login, I would like to access the session cookies, in order to pass them to another module (selenium here) to further process the page with an authentificated session.
I would like something like that:
def after_login(self, response):
# process response
.....
# access the cookies of that session to access another URL in the
# same domain with the autehnticated session.
# Something like:
session_cookies = XXX.get_session_cookies()
data = another_function(url,cookies)
Unfortunately, response.cookies does not return the session cookies.
How can I get the session cookies ? I was looking at the cookies middleware: scrapy.contrib.downloadermiddleware.cookies and scrapy.http.cookies but there doesn't seem to be any straightforward way to access the session cookies.
Some more details here bout my original question:
Unfortunately, I used your idea but I dind't see the cookies, although I know for sure that they exists since the scrapy.contrib.downloadermiddleware.cookies middleware does print out the cookies! These are exactly the cookies that I want to grab.
So here is what I am doing:
The after_login(self,response) method receives the response variable after proper authentication, and then I access an URL with the session data:
def after_login(self, response):
# testing to see if I can get the session cookies
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
cookies_test = cookieJar._cookies
print "cookies - test:",cookies_test
# URL access with authenticated session
url = "http://site.org/?id=XXXX"
request = Request(url=url,callback=self.get_pict)
return [request]
As the output below shows, there are indeed cookies, but I fail to capture them with cookieJar:
cookies - test: {}
2012-01-02 22:44:39-0800 [myspider] DEBUG: Sending cookies to: <GET http://www.facebook.com/profile.php?id=529907453>
Cookie: xxx=3..........; yyy=34.............; zzz=.................; uuu=44..........
So I would like to get a dictionary containing the keys xxx, yyy etc with the corresponding values.
Thanks :)
A classic example is having a login server, which provides a new session id after a successful login. This new session id should be used with another request.
Here is the code picked up from source which seems to work for me.
print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
Code:
def check_logged(self, response):
tmpCookie = response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
print 'cookie from login', response.headers.getlist('Set-Cookie')[0].split(";")[0].split("=")[1]
cookieHolder=dict(SESSION_ID=tmpCookie)
#print response.body
if "my name" in response.body:
yield Request(url="<<new url for another server>>",
cookies=cookieHolder,
callback=self."<<another function here>>")
else:
print "login failed"
return
Maybe this is an overkill, but i don't know how are you going to use those cookies, so it might be useful (an excerpt from real code - adapt it to your case):
from scrapy.http.cookies import CookieJar
class MySpider(BaseSpider):
def parse(self, response):
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
request = Request(nextPageLink, callback = self.parse2,
meta = {'dont_merge_cookies': True, 'cookie_jar': cookieJar})
cookieJar.add_cookie_header(request) # apply Set-Cookie ourselves
CookieJar has some useful methods.
If you still don't see the cookies - maybe they are not there?
UPDATE:
Looking at CookiesMiddleware code:
class CookiesMiddleware(object):
def _debug_cookie(self, request, spider):
if self.debug:
cl = request.headers.getlist('Cookie')
if cl:
msg = "Sending cookies to: %s" % request + os.linesep
msg += os.linesep.join("Cookie: %s" % c for c in cl)
log.msg(msg, spider=spider, level=log.DEBUG)
So, try request.headers.getlist('Cookie')
This works for me
response.request.headers.get('Cookie')
It seems to return all the cookies that where introduced by the middleware in the request, session's or otherwise.
As of 2021 (Scrapy 2.5.1), this is still not particularly straightforward. But you can access downloader middlewares (like CookiesMiddleware) from within a spider via self.crawler.engine.downloader:
def after_login(self, response):
downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
cookies_mw = next(iter(mw for mw in downloader_middlewares if isinstance(mw, CookiesMiddleware)))
jar = cookies_mw.jars[response.meta.get('cookiejar')].jar
cookies_list = [vars(cookie) for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()]
# or
cookies_dict = {cookie.name: cookie.value for domain in jar._cookies.values() for path in domain.values() for cookie in path.values()}
...
Both output formats above can be passed to other requests using the cookies parameter.

Resources