Scrapy works in shell but spider returns empty csv - shell

I am learning Scrapy. Now I just try to scrapy items and when I call spider:
planefinder]# scrapy crawl planefinder -o /User/spider/planefinder/pf.csv -t csv
it shows tech information and no scraped content (Crawled 0 pages .... etc), and it returns an empty csv file.
The problem is when i test xpath in scrapy shell it works:
>>> from scrapy.selector import Selector
>>> sel = Selector(response)
>>> flights = sel.xpath("//div[#class='col-md-12'][1]/div/div/table//tr")
>>> items = []
>>> for flt in flights:
... item = flt.xpath("td[1]/a/#href").extract_first()
... items.append(item)
...
>>> items
The following is my planeFinder.py code:
# -*-:coding:utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector, HtmlXPathSelector
from planefinder.items import arr_flt_Item, dep_flt_Item
class planefinder(CrawlSpider):
name = 'planefinder'
host = 'https://planefinder.net'
start_url = ['https://planefinder.net/data/airport/PEK/']
def parse(self, response):
arr_flights = response.xpath("//div[#class='col-md-12'][1]/div/div/table//tr")
dep_flights = response.xpath("//div[#class='col-md-12'][2]/div/div/table//tr")
for flight in arr_flights:
arr_item = arr_flt_Item()
arr_flt_url = flight.xpath('td[1]/a/#href').extract_first()
arr_item['arr_flt_No'] = flight.xpath('td[1]/a/text()').extract_first()
arr_item['STA'] = flight.xpath('td[2]/text()').extract_first()
arr_item['From'] = flight.xpath('td[3]/a/text()').extract_first()
arr_item['ETA'] = flight.xpath('td[4]/text()').extract_first()
yield arr_item

Please before going to CrawlSpider please check the docs for Spiders, some of the issues I've found were:
Instead of host use allowed_domains
Instead of start_url use start_urls
It seem that the page needs to have some cookies set or maybe it's using some kind of basic anti-bot protection, and you need to land somewhere else first.
Try this (I've also changed a bit :
# -*-:coding:utf-8 -*-
from scrapy import Field, Item, Request
from scrapy.spiders import CrawlSpider, Spider
class ArrivalFlightItem(Item):
arr_flt_no = Field()
arr_sta = Field()
arr_from = Field()
arr_eta = Field()
class PlaneFinder(Spider):
name = 'planefinder'
allowed_domains = ['planefinder.net']
start_urls = ['https://planefinder.net/data/airports']
def parse(self, response):
yield Request('https://planefinder.net/data/airport/PEK', callback=self.parse_flight)
def parse_flight(self, response):
flights_xpath = ('//*[contains(#class, "departure-board") and '
'./preceding-sibling::h2[contains(., "Arrivals")]]'
'//tr[not(./th) and not(./td[#class="spacer"])]')
for flight in response.xpath(flights_xpath):
arrival = ArrivalFlightItem()
arr_flt_url = flight.xpath('td[1]/a/#href').extract_first()
arrival['arr_flt_no'] = flight.xpath('td[1]/a/text()').extract_first()
arrival['arr_sta'] = flight.xpath('td[2]/text()').extract_first()
arrival['arr_from'] = flight.xpath('td[3]/a/text()').extract_first()
arrival['arr_eta'] = flight.xpath('td[4]/text()').extract_first()
yield arrival

The problem here is not understanding correctly which "Spider" to use, as Scrapy offers different custom ones.
The main one, and the one you should be using is the simple Spider and not CrawlSpider, because CrawlSpider is used for a more deep and intensive search into forums, blogs, etc.
Just change the type of spider to:
from scrapy import Spider
class plane finder(Spider):
...

Check the value of ROBOTSTXT_OBEY in your settings.py file. By default it's set to True (but not when you run shell). Set it to False if you wan't to disobey robots.txt file.

Related

How do I enable following URL capability to work in my code?

I am attempting to add the follow url capability but can't seem to get it to work. I need to crawl all the pages. There are around 108 pages of the job listings. Thank you.
import scrapy
class JobItem(scrapy.Item):
# Data structure to store the title, company name and location of the job
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class PythonDocumentationSpider(scrapy.Spider):
name = 'pydoc'
start_urls = ['https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab']
def parse(self, response):
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href'):
follow_url = response.urljoin(follow_href.extract())
yield scrapy.Request(follow_url, callback=self.parse_page_title)
for a_el in response.xpath('//div[#class="-job-summary"]'):
section = JobItem()
section['title'] = a_el.xpath('.//a[#class="s-link s-link__visited job-link"]/text()').extract()[0]
span_texts = a_el.xpath('.//div[#class="fc-black-700 fs-body1 -company"]/span/text()').extract()
section['company'] = span_texts[0]
section['location'] = span_texts[1]
print(section['location'])
#print(type(section))
yield section
I am attempting to get the following url capability to work with my code and then be able to crawl the pages and store job postings in csv file.
.extract() return a list. In most cases you'll need to use .get() or .extract_first() instead if you don't need a list.
First you need to rewrite this part:
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href').getall(): # or .extract()
follow_url = response.urljoin(follow_href)
yield scrapy.Request(follow_url, callback=self.parse_page_title)

Possible to replace Scrapy's default lxml parser with Beautiful Soup's html5lib parser?

Question: Is there a way to integrate BeautifulSoup's html5lib parser into a scrapy project--instead of the scrapy's default lxml parser?
Scrapy's parser fails (for some elements) of my scrape pages.
This only happens every 2 out of 20 pages.
As a fix, I've added BeautifulSoup's parser to the project (which works).
That said, I feel like I'm doubling the work with conditionals and multiple parsers...at a certain point, what's the reason for using Scrapy's parser? The code does work....it feels like a hack.
I'm no expert--is there a more elegant way to do this?
Much appreciation in advance
Update: Adding a middleware class to scrapy (from the python package scrapy-beautifulsoup) works like a charm. Apparently, lxml from Scrapy is not as robust as BeautifulSoup's lxml. I didn't have to resort to the html5lib parser--which is 30X+ slower.
class BeautifulSoupMiddleware(object):
def __init__(self, crawler):
super(BeautifulSoupMiddleware, self).__init__()
self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
"""Overridden process_response would "pipe" response.body through BeautifulSoup."""
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
Original:
import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup
class SimpleSpider(scrapy.Spider):
name = 'SimpleSpider'
allowed_domains = ['totally-above-board.com']
start_urls = [
'https://totally-above-board.com/nefarious-scrape-page.html'
]
custom_settings = {
'ITEM_PIPELINES': {
'crawler.spiders.simple_spider.Pipeline': 400
}
}
def parse(self, response):
yield from self.parse_company_info(response)
yield from self.parse_reviews(response)
def parse_company_info(self, response):
print('parse_company_info')
print('==================')
loader = ItemLoader(CompanyItem(), response=response)
loader.add_xpath('company_name',
'//h1[contains(#class,"sp-company-name")]//span//text()')
yield loader.load_item()
def parse_reviews(self, response):
print('parse_reviews')
print('=============')
# Beautiful Soup
selector = Selector(response)
# On the Page (Total Reviews) # 49
search = '//span[contains(#itemprop,"reviewCount")]//text()'
review_count = selector.xpath(search).get()
review_count = int(float(review_count))
# Number of elements Scrapy's LXML Could find # 0
search = '//div[#itemprop ="review"]'
review_element_count = len(selector.xpath(search))
# Use Scrapy or Beautiful Soup?
if review_count > review_element_count:
# Try Beautiful Soup
soup = BeautifulSoup(response.text, "lxml")
root = soup.findAll("div", {"itemprop": "review"})
for review in root:
loader = ItemLoader(ReviewItem(), selector=review)
review_text = review.find("span", {"itemprop": "reviewBody"}).text
loader.add_value('review_text', review_text)
author = review.find("span", {"itemprop": "author"}).text
loader.add_value('author', author)
yield loader.load_item()
else:
# Try Scrapy
review_list_xpath = '//div[#itemprop ="review"]'
selector = Selector(response)
for review in selector.xpath(review_list_xpath):
loader = ItemLoader(ReviewItem(), selector=review)
loader.add_xpath('review_text',
'.//span[#itemprop="reviewBody"]//text()')
loader.add_xpath('author',
'.//span[#itemprop="author"]//text()')
yield loader.load_item()
yield from self.paginate_reviews(response)
def paginate_reviews(self, response):
print('paginate_reviews')
print('================')
# Try Scrapy
selector = Selector(response)
search = '''//span[contains(#class,"item-next")]
//a[#class="next"]/#href
'''
next_reviews_link = selector.xpath(search).get()
# Try Beautiful Soup
if next_reviews_link is None:
soup = BeautifulSoup(response.text, "lxml")
try:
next_reviews_link = soup.find("a", {"class": "next"})['href']
except Exception as e:
pass
if next_reviews_link:
yield response.follow(next_reviews_link, self.parse_reviews)
It’s a common feature request for Parsel, Scrapy’s library for XML/HTML scraping.
However, you don’t need to wait for such a feature to be implemented. You can fix the HTML code using BeautifulSoup, and use Parsel on the fixed HTML:
from bs4 import BeautifulSoup
# …
response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))
You can get a charset error using the #Gallaecio's answer, if the original page was not utf-8 encoded, because the response has set to other encoding.
So, you must first switch the encoding.
In addition, there may be a problem of character escaping.
For example, if the character < is encountered in the text of html, then it must be escaped as <. Otherwise, "lxml" will delete it and the text near it, considering it an erroneous html tag.
"html5lib" escapes characters, but is slow.
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html5lib')))
"html.parser" is faster, but from_encoding must also be specified (to example 'cp1251').
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html.parser', from_encoding='cp1251')))

Not able to scrape more then 10 records using scrapy

I'm new to scrapy and python. I'm using scrapy for scraping the data.
The site using AJAX for pagination so I'm not able to get the data more than 10 records I'm posting my code
from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re
class JustdialSpider(Spider):
name = "JustdialSpider"
allowed_domains = ["justdial.com"]
start_urls = [
"http://www.justdial.com/Mumbai/Dentists/ct-385543",
]
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
questions = Selector(response).xpath('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
for question in questions:
item = JustdialItem()
item['name'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
item['contact'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/b/text()').extract()
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(item['name'], item['contact']))
f.close()
return item
# if running code above this I'm able to get 10 records of the page
# This code not working for getting data more than 10 records, Pagination using AJAX
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
next_page = int(re.findall('page=(\d+)', url)[0]) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
print next_url
def parse_ajaxurl(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
Please help me
Thanks.
Actually if you disable javascript when viewing the page you'll notice that site offers traditional pagination instead of "never ending" AJAX one.
Using this you can simply find url of next page and continue:
def parse(self, response):
questions = response.xpath('//div[contains(#class,"store-details")]')
for question in questions:
item = dict()
item['name'] = question.xpath("h4/span/a/text()").extract_first()
item['contact'] = question.xpath("p[#class='contact-info']//b/text()").extract_first()
yield item
# next page
next_page = response.xpath("//a[#rel='next']/#href").extract_first()
if next_page:
yield Request(next_page)
I also fixed up your xpaths but in overal the only bit that changed is those 3 lines under # next page comment.
As a side note I've noticed you are saving to csv in spider where you can use built-in scrapy exporter command like:
scrapy crawl myspider --output results.csv

How can I change the scrapy download image name in pipelines?

from __future__ import unicode_literals
import sys
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
import os
reload(sys)
sys.setdefaultencoding('utf-8')
class TetePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
item['image'] = []
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Items contains no images')
item['image_paths'] = image_paths
for i in item['image_paths']:
item['image'].append(item['image_titles']+i[-8:])
item['image_paths'] = item['image']
return item
#
scrapy version :1.0
This is my code,It can download images,but the image names are the result of the image url SHA1 hash.
I want to change the image name using custom name.in ths example is :item['image_titles']+i[-8:],int the scrapy shell the item['image_titles']+i[-8:] can be normal output,where is the reason?
class TetePipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url, meta={'item': item})
def file_path(self, request, response=None, info=None):
item = request.meta['item']
image_guid = request.url.split('/')[-1]
image_name = item['image_titles']+image_guid[-8:]
return image_name
Change the file_path func, return the image_name, because the get_media_requests will download the image, item_completed has downloaded

Scrapy restrict_xpath syntax error

I'm trying to limit Scrapy to a particular XPath location for following links. The XPath is correct (according to XPath Helper plugin for chrome), but when I run my Crawl Spider I get a syntax error at my Rule.
My Spider code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class BassSpider(CrawlSpider):
name = "bass"
allowed_domains = ["talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126"]
rules = [Rule(SgmlLinkExtractor(allow=['/f126/index*']), callback='parse_item', follow=True, restrict_xpaths=('//a[starts-with(#title,"Next ")]')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('//table[#id="threadslist"]/tbody/tr/td[#class="alt1"][2]/div')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/#href').extract()
items.append(item)
return items
So inside the rule, the XPath '//a[starts-with(#title,"Next ")]' is returning an error and I'm not sure why, since the actual XPath is valid. I'm simply trying to get the spider to crawl each "Next Page" link. Can anyone help me out. Please let me know if you need any other parts of my code for help.
It's not the xpath that is the issue, rather that the syntax of the complete rule is incorrect. The following rule fixes the syntax error, but should be checked to make sure that it is doing what is required:
rules = (Rule(SgmlLinkExtractor(allow=['/f126/index*'], restrict_xpaths=('//a[starts-with(#title,"Next ")]')),
callback='parse_item', follow=True, ),
)
As a general point, posting the actual error in a question is highly recommended since the perception of the error and the actual error may well differ.

Resources