How to use xpath to find a text node - xpath

I'm using scrap to get user informations on stack overflow. And I try to use //h2[#class="user-card-name"]/text()[1] to get that name. However I get this:
['\n Ignacio Vazquez-Abrams\n \n
Someone plz help.

You should be able to clean up surrounding whitespaces from the result easily using Python's strip() function :
In [2]: result = response.xpath('//h2[#class="user-card-name"]/text()[1]').extract()
In [3]: [r.strip() for r in result]
Out[3]: [u'Ignacio Vazquez-Abrams']

The recommended way when crawling unstructured data with scrapy is to use ItemLoaders, and scrapylib offers some very good default_input_processor and default_output_processor.
items.py
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from scrapylib.processors import default_input_processor
from scrapylib.processors import default_output_processor
class MyItem(Item):
field1 = Field()
field2 = Field()
class MyItemLoader(ItemLoader):
default_item_class = MyItem
default_input_processor = default_input_processor
default_output_processor = default_output_processor
now on your spider code, populate your items with:
from myproject.items import MyItemLoader
...
... # on your callback
loader = MyItemLoader(response=response)
loader.add_xpath('field1', '//h2[#class="user-card-name"]/text()[1]')
... keep populating the loader
yield loader.load_item() # to return an item

Try this:
result = response.xpath('//h2[#class="user-card-name"]/text()').extract()
result = result[0].strip() if result else ''

Related

Flask wtf validators Length min and max does not works

I build Flask app with sqlite3 database and app.route add and app.route save
I have a problem with validators some of them works some does not works
validators.DataRequired() works
URLField() works
but validators.Length(min=1,max=15) does not works at all
from flask_wtf import FlaskForm #I aslo I also tried with Form
from wtforms import BooleanField, StringField, IntegerField, validators,SubmitField
from wtforms.fields.html5 import URLField
class AddRecValidators(FlaskForm): # <---I aslo I also tried with Form
title = StringField('Title:',[validators.DataRequired(), validators.Length(min=1,max=35,message="Title too long max 35 characters")])
authors = StringField('Authors:',[validators.Length(min=1,max=100)])
published_date = IntegerField('Published date:',[validators.Length(min=1,max=4)])
isbn_or_identifier = StringField('ISBN:',[validators.Length(min=1,max=15)])
page_count = IntegerField('Page count:',[ validators.Length(min=1,max=10000)])
language = StringField('Language:',[ validators.Length(min=1,max=3)])
image_links = URLField('Image links:')
submit = SubmitField(label=('Add to library'))
It looks like you're using the wrong validators for the type of input you're validating.
validators.Length() is for strings, see here
For the integers, try using NumberRange
from flask_wtf import FlaskForm
from wtforms import BooleanField, StringField, IntegerField, validators,SubmitField
from wtforms.fields.html5 import URLField
class AddRecValidators(FlaskForm):
title = StringField('Title:',[validators.DataRequired(), validators.Length(min=1,max=35,message="Title too long max 35 characters")])
authors = StringField('Authors:',[validators.Length(min=1,max=100)])
published_date = IntegerField('Published date:',[validators.NumberRange(min=1,max=4)]) # <-- note change to NumberRange
isbn_or_identifier = StringField('ISBN:',[validators.Length(min=1,max=15)])
page_count = IntegerField('Page count:',[ validators.NumberRange(min=1,max=10000)]) # <-- note change to NumberRange
language = StringField('Language:',[ validators.Length(min=1,max=3)])
image_links = URLField('Image links:')
submit = SubmitField(label=('Add to library'))
Also, here are the docs for flask-wtforms validators.

Possible to replace Scrapy's default lxml parser with Beautiful Soup's html5lib parser?

Question: Is there a way to integrate BeautifulSoup's html5lib parser into a scrapy project--instead of the scrapy's default lxml parser?
Scrapy's parser fails (for some elements) of my scrape pages.
This only happens every 2 out of 20 pages.
As a fix, I've added BeautifulSoup's parser to the project (which works).
That said, I feel like I'm doubling the work with conditionals and multiple parsers...at a certain point, what's the reason for using Scrapy's parser? The code does work....it feels like a hack.
I'm no expert--is there a more elegant way to do this?
Much appreciation in advance
Update: Adding a middleware class to scrapy (from the python package scrapy-beautifulsoup) works like a charm. Apparently, lxml from Scrapy is not as robust as BeautifulSoup's lxml. I didn't have to resort to the html5lib parser--which is 30X+ slower.
class BeautifulSoupMiddleware(object):
def __init__(self, crawler):
super(BeautifulSoupMiddleware, self).__init__()
self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
"""Overridden process_response would "pipe" response.body through BeautifulSoup."""
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
Original:
import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup
class SimpleSpider(scrapy.Spider):
name = 'SimpleSpider'
allowed_domains = ['totally-above-board.com']
start_urls = [
'https://totally-above-board.com/nefarious-scrape-page.html'
]
custom_settings = {
'ITEM_PIPELINES': {
'crawler.spiders.simple_spider.Pipeline': 400
}
}
def parse(self, response):
yield from self.parse_company_info(response)
yield from self.parse_reviews(response)
def parse_company_info(self, response):
print('parse_company_info')
print('==================')
loader = ItemLoader(CompanyItem(), response=response)
loader.add_xpath('company_name',
'//h1[contains(#class,"sp-company-name")]//span//text()')
yield loader.load_item()
def parse_reviews(self, response):
print('parse_reviews')
print('=============')
# Beautiful Soup
selector = Selector(response)
# On the Page (Total Reviews) # 49
search = '//span[contains(#itemprop,"reviewCount")]//text()'
review_count = selector.xpath(search).get()
review_count = int(float(review_count))
# Number of elements Scrapy's LXML Could find # 0
search = '//div[#itemprop ="review"]'
review_element_count = len(selector.xpath(search))
# Use Scrapy or Beautiful Soup?
if review_count > review_element_count:
# Try Beautiful Soup
soup = BeautifulSoup(response.text, "lxml")
root = soup.findAll("div", {"itemprop": "review"})
for review in root:
loader = ItemLoader(ReviewItem(), selector=review)
review_text = review.find("span", {"itemprop": "reviewBody"}).text
loader.add_value('review_text', review_text)
author = review.find("span", {"itemprop": "author"}).text
loader.add_value('author', author)
yield loader.load_item()
else:
# Try Scrapy
review_list_xpath = '//div[#itemprop ="review"]'
selector = Selector(response)
for review in selector.xpath(review_list_xpath):
loader = ItemLoader(ReviewItem(), selector=review)
loader.add_xpath('review_text',
'.//span[#itemprop="reviewBody"]//text()')
loader.add_xpath('author',
'.//span[#itemprop="author"]//text()')
yield loader.load_item()
yield from self.paginate_reviews(response)
def paginate_reviews(self, response):
print('paginate_reviews')
print('================')
# Try Scrapy
selector = Selector(response)
search = '''//span[contains(#class,"item-next")]
//a[#class="next"]/#href
'''
next_reviews_link = selector.xpath(search).get()
# Try Beautiful Soup
if next_reviews_link is None:
soup = BeautifulSoup(response.text, "lxml")
try:
next_reviews_link = soup.find("a", {"class": "next"})['href']
except Exception as e:
pass
if next_reviews_link:
yield response.follow(next_reviews_link, self.parse_reviews)
It’s a common feature request for Parsel, Scrapy’s library for XML/HTML scraping.
However, you don’t need to wait for such a feature to be implemented. You can fix the HTML code using BeautifulSoup, and use Parsel on the fixed HTML:
from bs4 import BeautifulSoup
# …
response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))
You can get a charset error using the #Gallaecio's answer, if the original page was not utf-8 encoded, because the response has set to other encoding.
So, you must first switch the encoding.
In addition, there may be a problem of character escaping.
For example, if the character < is encountered in the text of html, then it must be escaped as <. Otherwise, "lxml" will delete it and the text near it, considering it an erroneous html tag.
"html5lib" escapes characters, but is slow.
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html5lib')))
"html.parser" is faster, but from_encoding must also be specified (to example 'cp1251').
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html.parser', from_encoding='cp1251')))

Scrapy works in shell but spider returns empty csv

I am learning Scrapy. Now I just try to scrapy items and when I call spider:
planefinder]# scrapy crawl planefinder -o /User/spider/planefinder/pf.csv -t csv
it shows tech information and no scraped content (Crawled 0 pages .... etc), and it returns an empty csv file.
The problem is when i test xpath in scrapy shell it works:
>>> from scrapy.selector import Selector
>>> sel = Selector(response)
>>> flights = sel.xpath("//div[#class='col-md-12'][1]/div/div/table//tr")
>>> items = []
>>> for flt in flights:
... item = flt.xpath("td[1]/a/#href").extract_first()
... items.append(item)
...
>>> items
The following is my planeFinder.py code:
# -*-:coding:utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector, HtmlXPathSelector
from planefinder.items import arr_flt_Item, dep_flt_Item
class planefinder(CrawlSpider):
name = 'planefinder'
host = 'https://planefinder.net'
start_url = ['https://planefinder.net/data/airport/PEK/']
def parse(self, response):
arr_flights = response.xpath("//div[#class='col-md-12'][1]/div/div/table//tr")
dep_flights = response.xpath("//div[#class='col-md-12'][2]/div/div/table//tr")
for flight in arr_flights:
arr_item = arr_flt_Item()
arr_flt_url = flight.xpath('td[1]/a/#href').extract_first()
arr_item['arr_flt_No'] = flight.xpath('td[1]/a/text()').extract_first()
arr_item['STA'] = flight.xpath('td[2]/text()').extract_first()
arr_item['From'] = flight.xpath('td[3]/a/text()').extract_first()
arr_item['ETA'] = flight.xpath('td[4]/text()').extract_first()
yield arr_item
Please before going to CrawlSpider please check the docs for Spiders, some of the issues I've found were:
Instead of host use allowed_domains
Instead of start_url use start_urls
It seem that the page needs to have some cookies set or maybe it's using some kind of basic anti-bot protection, and you need to land somewhere else first.
Try this (I've also changed a bit :
# -*-:coding:utf-8 -*-
from scrapy import Field, Item, Request
from scrapy.spiders import CrawlSpider, Spider
class ArrivalFlightItem(Item):
arr_flt_no = Field()
arr_sta = Field()
arr_from = Field()
arr_eta = Field()
class PlaneFinder(Spider):
name = 'planefinder'
allowed_domains = ['planefinder.net']
start_urls = ['https://planefinder.net/data/airports']
def parse(self, response):
yield Request('https://planefinder.net/data/airport/PEK', callback=self.parse_flight)
def parse_flight(self, response):
flights_xpath = ('//*[contains(#class, "departure-board") and '
'./preceding-sibling::h2[contains(., "Arrivals")]]'
'//tr[not(./th) and not(./td[#class="spacer"])]')
for flight in response.xpath(flights_xpath):
arrival = ArrivalFlightItem()
arr_flt_url = flight.xpath('td[1]/a/#href').extract_first()
arrival['arr_flt_no'] = flight.xpath('td[1]/a/text()').extract_first()
arrival['arr_sta'] = flight.xpath('td[2]/text()').extract_first()
arrival['arr_from'] = flight.xpath('td[3]/a/text()').extract_first()
arrival['arr_eta'] = flight.xpath('td[4]/text()').extract_first()
yield arrival
The problem here is not understanding correctly which "Spider" to use, as Scrapy offers different custom ones.
The main one, and the one you should be using is the simple Spider and not CrawlSpider, because CrawlSpider is used for a more deep and intensive search into forums, blogs, etc.
Just change the type of spider to:
from scrapy import Spider
class plane finder(Spider):
...
Check the value of ROBOTSTXT_OBEY in your settings.py file. By default it's set to True (but not when you run shell). Set it to False if you wan't to disobey robots.txt file.

Xpath selector not working in Scrapy

I'm trying to extract the text from this Xpath:
//*/li[contains(., "Full Name")]/span/text()
from this webpage:
http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs
I've tested it in Google Chrome's Console (which works), as with many other variations of the Xpath, but I can't get it to work with Scrapy. My code only returns "{}".
Here's where I have been testing it in my code, for context:
def parse_bio(self, response):
loader = response.meta['loader']
fullnameValue = response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
loader.add_value('fullName', fullnameValue)
return loader.load_item()
The problem isn't my code (I don't think), it works fine with other (very broad) Xpath selectors. But I'm not sure what's wrong with the Xpath. I have JavaScript disabled, if that makes a difference.
Any help would be great!
Edit: Here is the rest of the code to make it more clear:
from scrapy import Spider, Request, Selector
from votesmart.items import LegislatorsItems, TheLoader
class VSSpider(Spider):
name = "vs"
allowed_domains = ["votesmart.org"]
start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]
def parse(self, response):
for href in response.xpath('//h5/a/#href').extract():
person_url = response.urljoin(href)
yield Request(person_url, callback=self.candidatesPoliticalSummary)
def candidatesPoliticalSummary(self, response):
item = LegislatorsItems()
l = TheLoader(item=LegislatorsItems(), response=response)
...
#populating items with item loader. works fine
# create right bio url and pass item loader to it
bio_url = response.url.replace('votesmart.org/candidate/',
'votesmart.org/candidate/biography/')
return Request(bio_url, callback=self.parse_bio, meta={'loader': l})
def parse_bio(self, response):
loader = response.meta['loader']
print response.request.url
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
return loader.load_item()
I figured out my problem! Many pages on the site were login protected, and I wasn't able to scrape from pages that I couldn't access in the first place. Scrapy's form request did the trick. Thanks for all the help (especially the suggestion of using view(response), which is super helpful).
The expression is working for me in the shell perfectly as is:
$ scrapy shell "http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs"
In [1]: response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
Out[1]: [u'Norma Smith']
Try using the add_xpath() method instead:
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')

Scrapy restrict_xpath syntax error

I'm trying to limit Scrapy to a particular XPath location for following links. The XPath is correct (according to XPath Helper plugin for chrome), but when I run my Crawl Spider I get a syntax error at my Rule.
My Spider code is:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import BassItem
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class BassSpider(CrawlSpider):
name = "bass"
allowed_domains = ["talkbass.com"]
start_urls = ["http://www.talkbass.com/forum/f126"]
rules = [Rule(SgmlLinkExtractor(allow=['/f126/index*']), callback='parse_item', follow=True, restrict_xpaths=('//a[starts-with(#title,"Next ")]')]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
ads = hxs.select('//table[#id="threadslist"]/tbody/tr/td[#class="alt1"][2]/div')
items = []
for ad in ads:
item = BassItem()
item['title'] = ad.select('a/text()').extract()
item['link'] = ad.select('a/#href').extract()
items.append(item)
return items
So inside the rule, the XPath '//a[starts-with(#title,"Next ")]' is returning an error and I'm not sure why, since the actual XPath is valid. I'm simply trying to get the spider to crawl each "Next Page" link. Can anyone help me out. Please let me know if you need any other parts of my code for help.
It's not the xpath that is the issue, rather that the syntax of the complete rule is incorrect. The following rule fixes the syntax error, but should be checked to make sure that it is doing what is required:
rules = (Rule(SgmlLinkExtractor(allow=['/f126/index*'], restrict_xpaths=('//a[starts-with(#title,"Next ")]')),
callback='parse_item', follow=True, ),
)
As a general point, posting the actual error in a question is highly recommended since the perception of the error and the actual error may well differ.

Resources