Errors while scraping pages with json and scrapy - windows

I am scraping producthunt.com using Scrapy with Python3 on Win10. I am confused by my code behaviour, which duplicates some fields in output data.
Can anyone point at the reason of that and suggest a solution?
Code:
http://pastebin.com/VvFGCmDJ
Sample of the output:
http://pastebin.com/ffx0HN54

I haven't run your code but creating the Item instance definitely needs to be inside the for loop:
...
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
topic = jsonresponse['posts']
for post in topic:
service = ProducthuntItem()
service['name'] = post['name'].replace(";", " ")
...

Related

Scrapy Xpath returns empty output

I try to scrapy this data (the 71) from this line of code here:
<span class="text-pill text-pill--steel tooltip tooltipstered" data-options="{"theme": "white"}">71</span>
from the website
https://www.attheraces.com//racecard/Hamilton/28-September-2020/1330
I tried
class ErgebnisseSpider(scrapy.Spider):
name = 'namen'
allowed_domains = ['www.attheraces.com/']
start_urls = ['https://www.attheraces.com//racecard/Hamilton/28-September-2020/1330']
def parse(self, response):
starterkomplett = response.xpath('//div[#class="column width--tablet-wide-18"]')
for rennen2 in starterkomplett:
rating = rennen2.xpath('//span[#class="text-pill text-pill--steel tooltip tooltipstered"]').getall()
yield {
'rennen2_rating' : rating,
}
But I am not getting the "74" rating. Neither with /text() nor without, but with the inspection tool from chrome I can select the span with the aformentioned code... what am I missing here?
With other values on the side the code works, but not for the rating...
I am kind of new and learning, but I did not come any further googleing myself, so I guessed I ask here - again - Sorry for beein such a noob :P
See the page source
response.xpath("//*[#class='text-pill text-pill--steel tooltip']//text()").getall()

How do I enable following URL capability to work in my code?

I am attempting to add the follow url capability but can't seem to get it to work. I need to crawl all the pages. There are around 108 pages of the job listings. Thank you.
import scrapy
class JobItem(scrapy.Item):
# Data structure to store the title, company name and location of the job
title = scrapy.Field()
company = scrapy.Field()
location = scrapy.Field()
class PythonDocumentationSpider(scrapy.Spider):
name = 'pydoc'
start_urls = ['https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab']
def parse(self, response):
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href'):
follow_url = response.urljoin(follow_href.extract())
yield scrapy.Request(follow_url, callback=self.parse_page_title)
for a_el in response.xpath('//div[#class="-job-summary"]'):
section = JobItem()
section['title'] = a_el.xpath('.//a[#class="s-link s-link__visited job-link"]/text()').extract()[0]
span_texts = a_el.xpath('.//div[#class="fc-black-700 fs-body1 -company"]/span/text()').extract()
section['company'] = span_texts[0]
section['location'] = span_texts[1]
print(section['location'])
#print(type(section))
yield section
I am attempting to get the following url capability to work with my code and then be able to crawl the pages and store job postings in csv file.
.extract() return a list. In most cases you'll need to use .get() or .extract_first() instead if you don't need a list.
First you need to rewrite this part:
for follow_href in response.xpath('//h2[#class="fs-body2 job-details__spaced mb4"]/a/#href').getall(): # or .extract()
follow_url = response.urljoin(follow_href)
yield scrapy.Request(follow_url, callback=self.parse_page_title)

Possible to replace Scrapy's default lxml parser with Beautiful Soup's html5lib parser?

Question: Is there a way to integrate BeautifulSoup's html5lib parser into a scrapy project--instead of the scrapy's default lxml parser?
Scrapy's parser fails (for some elements) of my scrape pages.
This only happens every 2 out of 20 pages.
As a fix, I've added BeautifulSoup's parser to the project (which works).
That said, I feel like I'm doubling the work with conditionals and multiple parsers...at a certain point, what's the reason for using Scrapy's parser? The code does work....it feels like a hack.
I'm no expert--is there a more elegant way to do this?
Much appreciation in advance
Update: Adding a middleware class to scrapy (from the python package scrapy-beautifulsoup) works like a charm. Apparently, lxml from Scrapy is not as robust as BeautifulSoup's lxml. I didn't have to resort to the html5lib parser--which is 30X+ slower.
class BeautifulSoupMiddleware(object):
def __init__(self, crawler):
super(BeautifulSoupMiddleware, self).__init__()
self.parser = crawler.settings.get('BEAUTIFULSOUP_PARSER', "html.parser")
#classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_response(self, request, response, spider):
"""Overridden process_response would "pipe" response.body through BeautifulSoup."""
return response.replace(body=str(BeautifulSoup(response.body, self.parser)))
Original:
import scrapy
from scrapy.item import Item, Field
from scrapy.loader.processors import TakeFirst, MapCompose
from scrapy import Selector
from scrapy.loader import ItemLoader
from w3lib.html import remove_tags
from bs4 import BeautifulSoup
class SimpleSpider(scrapy.Spider):
name = 'SimpleSpider'
allowed_domains = ['totally-above-board.com']
start_urls = [
'https://totally-above-board.com/nefarious-scrape-page.html'
]
custom_settings = {
'ITEM_PIPELINES': {
'crawler.spiders.simple_spider.Pipeline': 400
}
}
def parse(self, response):
yield from self.parse_company_info(response)
yield from self.parse_reviews(response)
def parse_company_info(self, response):
print('parse_company_info')
print('==================')
loader = ItemLoader(CompanyItem(), response=response)
loader.add_xpath('company_name',
'//h1[contains(#class,"sp-company-name")]//span//text()')
yield loader.load_item()
def parse_reviews(self, response):
print('parse_reviews')
print('=============')
# Beautiful Soup
selector = Selector(response)
# On the Page (Total Reviews) # 49
search = '//span[contains(#itemprop,"reviewCount")]//text()'
review_count = selector.xpath(search).get()
review_count = int(float(review_count))
# Number of elements Scrapy's LXML Could find # 0
search = '//div[#itemprop ="review"]'
review_element_count = len(selector.xpath(search))
# Use Scrapy or Beautiful Soup?
if review_count > review_element_count:
# Try Beautiful Soup
soup = BeautifulSoup(response.text, "lxml")
root = soup.findAll("div", {"itemprop": "review"})
for review in root:
loader = ItemLoader(ReviewItem(), selector=review)
review_text = review.find("span", {"itemprop": "reviewBody"}).text
loader.add_value('review_text', review_text)
author = review.find("span", {"itemprop": "author"}).text
loader.add_value('author', author)
yield loader.load_item()
else:
# Try Scrapy
review_list_xpath = '//div[#itemprop ="review"]'
selector = Selector(response)
for review in selector.xpath(review_list_xpath):
loader = ItemLoader(ReviewItem(), selector=review)
loader.add_xpath('review_text',
'.//span[#itemprop="reviewBody"]//text()')
loader.add_xpath('author',
'.//span[#itemprop="author"]//text()')
yield loader.load_item()
yield from self.paginate_reviews(response)
def paginate_reviews(self, response):
print('paginate_reviews')
print('================')
# Try Scrapy
selector = Selector(response)
search = '''//span[contains(#class,"item-next")]
//a[#class="next"]/#href
'''
next_reviews_link = selector.xpath(search).get()
# Try Beautiful Soup
if next_reviews_link is None:
soup = BeautifulSoup(response.text, "lxml")
try:
next_reviews_link = soup.find("a", {"class": "next"})['href']
except Exception as e:
pass
if next_reviews_link:
yield response.follow(next_reviews_link, self.parse_reviews)
It’s a common feature request for Parsel, Scrapy’s library for XML/HTML scraping.
However, you don’t need to wait for such a feature to be implemented. You can fix the HTML code using BeautifulSoup, and use Parsel on the fixed HTML:
from bs4 import BeautifulSoup
# …
response = response.replace(body=str(BeautifulSoup(response.body, "html5lib")))
You can get a charset error using the #Gallaecio's answer, if the original page was not utf-8 encoded, because the response has set to other encoding.
So, you must first switch the encoding.
In addition, there may be a problem of character escaping.
For example, if the character < is encountered in the text of html, then it must be escaped as <. Otherwise, "lxml" will delete it and the text near it, considering it an erroneous html tag.
"html5lib" escapes characters, but is slow.
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html5lib')))
"html.parser" is faster, but from_encoding must also be specified (to example 'cp1251').
response = response.replace(encoding='utf-8',
body=str(BeautifulSoup(response.body, 'html.parser', from_encoding='cp1251')))

Xpath selector not working in Scrapy

I'm trying to extract the text from this Xpath:
//*/li[contains(., "Full Name")]/span/text()
from this webpage:
http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs
I've tested it in Google Chrome's Console (which works), as with many other variations of the Xpath, but I can't get it to work with Scrapy. My code only returns "{}".
Here's where I have been testing it in my code, for context:
def parse_bio(self, response):
loader = response.meta['loader']
fullnameValue = response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
loader.add_value('fullName', fullnameValue)
return loader.load_item()
The problem isn't my code (I don't think), it works fine with other (very broad) Xpath selectors. But I'm not sure what's wrong with the Xpath. I have JavaScript disabled, if that makes a difference.
Any help would be great!
Edit: Here is the rest of the code to make it more clear:
from scrapy import Spider, Request, Selector
from votesmart.items import LegislatorsItems, TheLoader
class VSSpider(Spider):
name = "vs"
allowed_domains = ["votesmart.org"]
start_urls = ["https://votesmart.org/officials/WA/L/washington-state-legislative"]
def parse(self, response):
for href in response.xpath('//h5/a/#href').extract():
person_url = response.urljoin(href)
yield Request(person_url, callback=self.candidatesPoliticalSummary)
def candidatesPoliticalSummary(self, response):
item = LegislatorsItems()
l = TheLoader(item=LegislatorsItems(), response=response)
...
#populating items with item loader. works fine
# create right bio url and pass item loader to it
bio_url = response.url.replace('votesmart.org/candidate/',
'votesmart.org/candidate/biography/')
return Request(bio_url, callback=self.parse_bio, meta={'loader': l})
def parse_bio(self, response):
loader = response.meta['loader']
print response.request.url
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')
return loader.load_item()
I figured out my problem! Many pages on the site were login protected, and I wasn't able to scrape from pages that I couldn't access in the first place. Scrapy's form request did the trick. Thanks for all the help (especially the suggestion of using view(response), which is super helpful).
The expression is working for me in the shell perfectly as is:
$ scrapy shell "http://votesmart.org/candidate/biography/56110/norma-smith#.V9SwdZMrKRs"
In [1]: response.xpath('//*/li[contains(., "Full Name")]/span/text()').extract()
Out[1]: [u'Norma Smith']
Try using the add_xpath() method instead:
loader.add_xpath('fullName', '//*/li[contains(., "Full Name")]/span/text()')

Not able to scrape more then 10 records using scrapy

I'm new to scrapy and python. I'm using scrapy for scraping the data.
The site using AJAX for pagination so I'm not able to get the data more than 10 records I'm posting my code
from scrapy import Spider
from scrapy.selector import Selector
from scrapy import Request
from justdial.items import JustdialItem
import csv
from itertools import izip
import scrapy
import re
class JustdialSpider(Spider):
name = "JustdialSpider"
allowed_domains = ["justdial.com"]
start_urls = [
"http://www.justdial.com/Mumbai/Dentists/ct-385543",
]
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)
def parse(self, response):
questions = Selector(response).xpath('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]')
for question in questions:
item = JustdialItem()
item['name'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()').extract()
item['contact'] = question.xpath(
'//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/b/text()').extract()
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(izip(item['name'], item['contact']))
f.close()
return item
# if running code above this I'm able to get 10 records of the page
# This code not working for getting data more than 10 records, Pagination using AJAX
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Mumbai&search=Chemical+Dealers&where=&catid=944&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate='
next_page = int(re.findall('page=(\d+)', url)[0]) + 1
next_url = re.sub('page=\d+', 'page={}'.format(next_page), url)
print next_url
def parse_ajaxurl(self, response):
# e.g. http://www.justdial.com/Mumbai/Dentists/ct-385543
my_headers = {'Referer': response.url}
yield Request("ajax_request_url",
headers=my_headers,
callback=self.parse_ajax)
Please help me
Thanks.
Actually if you disable javascript when viewing the page you'll notice that site offers traditional pagination instead of "never ending" AJAX one.
Using this you can simply find url of next page and continue:
def parse(self, response):
questions = response.xpath('//div[contains(#class,"store-details")]')
for question in questions:
item = dict()
item['name'] = question.xpath("h4/span/a/text()").extract_first()
item['contact'] = question.xpath("p[#class='contact-info']//b/text()").extract_first()
yield item
# next page
next_page = response.xpath("//a[#rel='next']/#href").extract_first()
if next_page:
yield Request(next_page)
I also fixed up your xpaths but in overal the only bit that changed is those 3 lines under # next page comment.
As a side note I've noticed you are saving to csv in spider where you can use built-in scrapy exporter command like:
scrapy crawl myspider --output results.csv

Resources