Scrapy follow pagination AJAX Request - POST - ajax

I am quite new to scrapy and have build a few spiders.
I am trying to scrape reviews from this page. My spider so far crawls the first page and scrape those items, but when it comes to pagination it does not follow links.
I know this happens because it is an Ajax request, but is a POST and not a GET am newbie about these but I read this. I have read this post here and follow the "mini-tutorial" to get the url from the response that seems to be
http://www.pcguia.pt/category/reviews/sorter=recent&location=&loop=main+loop&action=sort&view=grid&columns=3&paginated=2&currentquery%5Bcategory_name%5D=reviews
but when I try to open it on browser it says
"Página nao encontrada"="PAGE NOT FOUND"
So far am I thinking right, what am I missing?
EDIT: my spider:
import scrapy
import json
from scrapy.http import FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem
class PcguiaSpider(scrapy.Spider):
name = "pcguia" #spider name to call in terminal
allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
def parse(self, response):
sel = Selector(response)
if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
hxs = Selector(response)
item_pub = ReviewItem()
item_pub['date']= hxs.xpath('//span[#class="date"]/text()').extract() # is in the format year-month-dayThours:minutes:seconds-timezone ex: 2015-03-31T09:40:00-0700
item_pub['title'] = hxs.xpath('//title/text()').extract()
#pagination code starts here
# if page has content
if sel.xpath('//div[#class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return
yield item_pub
output:
2015-05-12 14:53:45+0100 [scrapy] INFO: Scrapy 0.24.5 started (bot: pcguia)
2015-05-12 14:53:45+0100 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:53:45+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pcguia.spiders', 'SPIDER_MODULES': ['pcguia.spiders'], 'BOT_NAME': 'pcguia'}
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:53:45+0100 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:53:45+0100 [pcguia] INFO: Spider opened
2015-05-12 14:53:45+0100 [pcguia] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6033
2015-05-12 14:53:45+0100 [scrapy] DEBUG: Web service listening on 127.0.0.1:6090
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Crawled (200) <GET http://www.pcguia.pt/category/reviews/#paginated=1> (referer: None)
2015-05-12 14:53:45+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/category/reviews/>
{'date': '',
'title': [u'Reviews | PCGuia'],
}
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: http://www.pcguia.pt/category/reviews/)
2015-05-12 14:53:47+0100 [pcguia] DEBUG: Scraped from <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
{'date': ''
'title': ''
}

you can try this
from scrapy.http import FormRequest
from scrapy.selector import Selector
# other imports
class SpiderClass(Spider)
# spider name and all
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
def parse(self, response):
sel = Selector(response)
if page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
# your code here
#pagination code starts here
# if page has content
if sel.xpath('//div[#class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return
I have tested using scrapy shell and its working,
In scrapy Shell
In [0]: response.url
Out[0]: 'http://www.pcguia.pt/category/reviews/#paginated=1'
In [1]: from scrapy.http import FormRequest
In [2]: from scrapy.selector import Selector
In [3]: import json
In [4]: response.xpath('//h2/a/text()').extract()
Out[4]:
[u'HP Slate 8 Plus',
u'Astro A40 +MixAmp Pro',
u'Asus ROG G751J',
u'BQ Aquaris E5 HD 4G',
u'Asus GeForce GTX980 Strix',
u'AlienTech BattleBox Edition',
u'Toshiba Encore Mini WT7-C',
u'Samsung Galaxy Note 4',
u'Asus N551JK',
u'Western Digital My Passport Wireless',
u'Nokia Lumia 735',
u'Photoshop Elements 13',
u'AMD Radeon R9 285',
u'Asus GeForce GTX970 Stryx',
u'TP-Link AC750 Wifi Repeater']
In [5]: url = "http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php"
In [6]: formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':'2',
'currentquery[category_name]':'reviews'
}
In [7]: r = FormRequest(url=url, formdata=formdata)
In [8]: fetch(r)
2015-05-12 18:29:16+0530 [default] DEBUG: Crawled (200) <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7fcc247c4590>
[s] item {}
[s] r <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] request <POST http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] response <200 http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php>
[s] settings <scrapy.settings.Settings object at 0x7fcc2a74f450>
[s] spider <Spider 'default' at 0x7fcc239ba990>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [9]: json_data = json.loads(response.body)
In [10]: sell = Selector(text=json_data.get('content', ''))
In [11]: sell.xpath('//h2/a/text()').extract()
Out[11]:
[u'Asus ROG GR8',
u'Devolo dLAN 1200+',
u'Yezz Billy 4,7',
u'Sony Alpha QX1',
u'Toshiba Encore2 WT10',
u'BQ Aquaris E5 FullHD',
u'Toshiba Canvio AeroMobile',
u'Samsung Galaxy Tab S 10.5',
u'Modecom FreeTab 7001 HD',
u'Steganos Online Shield VPN',
u'AOC G2460PG G-Sync',
u'AMD Radeon R7 SSD',
u'Nvidia Shield',
u'Asus ROG PG278Q GSync',
u'NOX Krom Kombat']
EDIT
import scrapy
import json
from scrapy.http import FormRequest, Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from pcguia.items import ReviewItem
from dateutil import parser
import re
class PcguiaSpider(scrapy.Spider):
name = "pcguia" #spider name to call in terminal
allowed_domains = ['pcguia.pt'] #the domain where the spider is allowed to crawl
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1'] #url from which the spider will start crawling
page_incr = 1
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'
def parse(self, response):
sel = Selector(response)
if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))
review_links = sel.xpath('//h2/a/#href').extract()
for link in review_links:
yield Request(url=link, callback=self.parse_review)
#pagination code starts here
# if page has content
if sel.xpath('//div[#class="panel-wrapper"]'):
self.page_incr +=1
formdata = {
'sorter':'recent',
'location':'main loop',
'loop':'main loop',
'action':'sort',
'view':'grid',
'columns':'3',
'paginated':str(self.page_incr),
'currentquery[category_name]':'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return
def parse_review(self, response):
month_matcher = 'novembro|janeiro|agosto|mar\xe7o|fevereiro|junho|dezembro|julho|abril|maio|outubro|setembro'
month_dict = {u'abril': u'April',
u'agosto': u'August',
u'dezembro': u'December',
u'fevereiro': u'February',
u'janeiro': u'January',
u'julho': u'July',
u'junho': u'June',
u'maio': u'May',
u'mar\xe7o': u'March',
u'novembro': u'November',
u'outubro': u'October',
u'setembro': u'September'}
review_date = response.xpath('//span[#class="date"]/text()').extract()
review_date = review_date[0].strip().strip('Publicado a').lower() if review_date else ''
month = re.findall('%s'% month_matcher, review_date)[0]
_date = parser.parse(review_date.replace(month, month_dict.get(month))).strftime('%Y-%m-%dT%H:%M:%T')
title = response.xpath('//h1[#itemprop="itemReviewed"]/text()').extract()
title = title[0].strip() if title else ''
item_pub = ReviewItem(
date=_date,
title=title)
yield item_pub
output
{'date': '2014-11-05T00:00:00', 'title': u'Samsung Galaxy Tab S 10.5'}

The proper solution for this would be using the selenium.
See the problem you are facing is that new source code don't get updated in your scrapy spider.
Selenium will help you to click on the subsequent links and pass the updated source code to your response.xpath.
I can provide you with more help if you just share the scrapy code you are using.

Related

Add user to existing user group in Nifi

Nipyapi version: 0.14.0
NiFi version: 1.11
NiFi-Registry version:
Python version:3.6
Operating System: Linux
Description
I want to add a user (already exist or just created ) to the existing user group inside info with API ?
What I Did
import nipyapi
import urllib3
from UserManagement import add_user
from nipyapi import config, canvas
from nipyapi import security
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
status = None
#admin_email = args["email"]
#zone = args["zone"]
nifi = "https://sdrginnifi0101:8081/nifi-api"
config.nifi_config.host = nifi
config.nifi_config.verify_ssl = False
client_cer = "/opt/application/sdr/apps/nifi/conf/client.cer"
client_key = "/opt/application/sdr/apps/nifi/conf/client.key"
security.set_service_ssl_context(service="nifi", client_cert_file=client_cer, client_key_file=client_key)
user_list = []
root_id = canvas.get_root_pg_id() # id of rot canvas of nifi
pg = canvas.get_process_group(root_id, "id") # ProcessGroupEntity
email_list=["hamza.bekkouri#gmail.com"]
ug = security.get_service_user_group("p10092", identifier_type='identity', service='nifi')
if len(email_list) == 1:
add_user(email_list[0])
user = security.get_service_user(email_list[0], identifier_type="identity", service="nifi")
userGroupDto = nipyapi.nifi.models.user_group_dto.UserGroupDTO(users=[user],access_policies=ug.component.access_policies)
userGroupEntity = nipyapi.nifi.models.user_group_entity.UserGroupEntity(component=userGroupDto)
else:
for mail in email_list:
add_user(mail)
user = security.get_service_user(mail, identifier_type="identity", service="nifi")
user_list.append(user)
userGroupDto = nipyapi.nifi.models.user_group_dto.UserGroupDTO(access_policies=ug.component.access_policies,users=user_list)
userGroupEntity = nipyapi.nifi.models.user_group_entity.UserGroupEntity(component=userGroupDto)
TenantApi = nipyapi.nifi.apis.tenants_api.TenantsApi(api_client=None)
TenantApi.update_user_group(ug.id,userGroupEntity)
ERROR
INFO --user (hamza.bekkouri.ext#orange.com) already exist
Traceback (most recent call last):
File "add_user_to_project.py", line 56, in <module>
TenantApi.update_user_group(ug.id,userGroupEntity)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/apis/tenants_api.py", line 1142, in update_user_group
(data) = self.update_user_group_with_http_info(id, body, **kwargs)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/apis/tenants_api.py", line 1229, in update_user_group_with_http_info
collection_formats=collection_formats)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/api_client.py", line 326, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/api_client.py", line 153, in __call_api
_request_timeout=_request_timeout)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/api_client.py", line 379, in request
body=body)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/rest.py", line 278, in PUT
body=body)
File "/home/nifi/.local/lib/python2.7/site-packages/nipyapi/nifi/rest.py", line 224, in request
raise ApiException(http_resp=r)
nipyapi.nifi.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Content-Length': '614', 'X-XSS-Protection': '1; mode=block', 'Content-Security-Policy': "frame-ancestors 'self'", 'Strict-Transport-Security': 'max-age=31540000', 'Vary': 'Accept-Encoding', 'Server': 'Jetty(9.4.19.v20190610)', 'Date': 'Sat, 06 Jun 2020 22:53:14 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Type': 'text/plain'})
HTTP response body: Unrecognized field "accessPolicies" (class org.apache.nifi.web.api.dto.TenantDTO), not marked as ignorable (6 known properties: "parentGroupId", "versionedComponentId", "position", "id", "identity", "configurable"])
at [Source: (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream); line: 1, column: 77] (through reference chain: org.apache.nifi.web.api.entity.UserGroupEntity["component"]->org.apache.nifi.web.api.dto.UserGroupDTO["users"]->java.util.HashSet[0]->org.apache.nifi.web.api.entity.TenantEntity["component"]->org.apache.nifi.web.api.dto.TenantDTO["accessPolicies"])
I don't find a solution to add user to an existing user

Scrapy return nothing when crawling AWS blogs site

Here's my attempt to crawl a list of URL in the first page of AWS blogs site.
But it return nothing. I think maybe there's something wrong with my xpath but not sure how to fix.
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//li[#class="m-card"]')
for blog in blogs:
url = blog.xpath('.//div[#class="m-card-title"]/a/#href').extract()
print(url)
Attempt 2
import scrapy
class AwsblogSpider(scrapy.Spider):
name = 'awsblog'
allowed_domains = ['aws.amazon.com/blogs']
start_urls = ['http://aws.amazon.com/blogs/']
def parse(self, response):
blogs = response.xpath('//div[#class="aws-directories-container"]')
for blog in blogs:
url = blog.xpath('//li[#class="m-card"]/div[#class="m-card-title"]/a/#href').extract_first()
print(url)
Log output:
2019-11-06 10:38:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-11-06 10:38:30 [scrapy.core.engine] INFO: Spider opened
2019-11-06 10:38:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-11-06 10:38:30 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/robots.txt> from <GET http://aws.amazon.com/robots.txt>
2019-11-06 10:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/robots.txt> (referer: None)
2019-11-06 10:38:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://aws.amazon.com/blogs/> from <GET http://aws.amazon.com/blogs/>
2019-11-06 10:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aws.amazon.com/blogs/> (referer: None)
2019-11-06 10:38:32 [scrapy.core.engine] INFO: Closing spider (finished)
Any help would be greatly appreciated!
You are using the wrong parser, The site is loading blog details via dynamic script function. Check out the page source for understanding the blog content availability.
For fetching the data, you should use dynamic data fetching techniques like below
1. Scrapy splash
2. Selenium

Scrapy view redirect to other page and get <400> error

I am trying to do scrapy view or fetch https://www.watsons.com.sg and the page will be redirected and return <400> error. Wonder if there is anyway to work around it. The log shows something like this:
2018-11-15 22:54:15 [scrapy.core.engine] INFO: Spider opened
2018-11-15 22:54:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-15 22:54:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-15 22:54:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F> from **<GET https://www.watsons.com.sg>
2018-11-15 22:54:16 [scrapy.core.engine] DEBUG: Crawled (400)** <GET https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F> (referer: None)
2018-11-15 22:54:17 [scrapy.core.engine] INFO: Closing spider (finished)
If I use request.get("https://www.watsons.com.sg") its fine. Any idea or comment much appreciated. Thanks.
Okay, so this is one of the weird behaviors of scrapy.
If you look at the location header in the HTTP response (with Firefox developer tools for example), you can see:
location: https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F
Note that there is no / between the .com.sg and the ?.
Looking at how Firefox behaves, on the next request it adds the missing /:
However, somehow scrapy does not do it!
If you look at your logs, when the HTTP 400 error is received, we can see that the / is missing.
This is being discussed in this issue: https://github.com/scrapy/scrapy/issues/1133
For now, the way I go around it is to have my own downloader middleware that normalizes the location header, before the response is being passed in the redirect middleware.
It looks like this:
from scrapy.spiders import Spider
from w3lib.url import safe_download_url
class MySpider(Spider):
name = 'watsons.com.sg'
start_urls = ['https://www.watsons.com.sg/']
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'spiders.myspider.FixLocationHeaderMiddleWare': 650
}
}
def parse(self, response):
pass
class FixLocationHeaderMiddleWare:
def process_response(self, request, response, spider):
if 'location' in response.headers:
response.headers['location'] = safe_download_url(response.headers['location'].decode())
return response

Scrapy xpath fail to find certain div in a webpage

I used Scrapy shell to load this webpage:
scrapy shell "http://goo.gl/VMNMuK"
and want to find:
response.xpath("//div[#class='inline']")
However, it returns []. If I use find in chrome inspect of this webpage, I could find 3 of "//div[#class='inline']". Is this a bug?
This pages's inline stuff is after </body></html>...
</body></html>
<script type="text/javascript">
var cpro_id="u2312677";
...
Here are some things to try:
rest = response.body[response.body.find('</html>')+8:]
from scrapy.selector import Selector
Selector(text=rest).xpath("//div[#class='inline']")
You can also use html5lib for parsing the response body, and work on an lxml document using lxml.html.html5parser for example. In the example scrapy shell session below, I had to use namespaces to work with XPath:
$ scrapy shell http://chuansong.me/n/2584954
2016-03-07 12:06:42 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-07 12:06:44 [scrapy] DEBUG: Crawled (200) <GET http://chuansong.me/n/2584954> (referer: None)
In [1]: response.xpath('//div[#class="inline"]')
Out[1]: []
In [2]: response.xpath('//*[#class="inline"]')
Out[2]: []
In [3]: response.xpath('//html')
Out[3]: [<Selector xpath='//html' data=u'<html lang="zh-CN">\n<head>\n<meta http-eq'>]
In [4]: from lxml.html import tostring, html5parser
In [5]: dochtml5 = html5parser.document_fromstring(response.body_as_unicode())
In [6]: type(dochtml5)
Out[6]: lxml.etree._Element
In [7]: dochtml5.xpath('//div[#class="inline"]')
Out[7]: []
In [8]: dochtml5.xpath('//html:div[#class="inline"]', namespaces={"html": "http://www.w3.org/1999/xhtml"})
Out[8]:
[<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cfe3998>,
<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cf691b8>,
<Element {http://www.w3.org/1999/xhtml}div at 0x7f858cf73680>]
In [9]: for div in dochtml5.xpath('//html:div[#class="inline"]', namespaces={"html": "http://www.w3.org/1999/xhtml"}):
print tostring(div)
....:
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:span>新浪名博、畅销书作家王珣的原创自媒体,“芙蓉树下”的又一片新天地,愿你美丽优雅地走过全世界。</html:span>
</html:div>
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:img src="http://q.chuansong.me/beauties-4.jpg" alt="美人的底气 微信二维码" height="210px" width="210px"></html:img>
</html:div>
<html:div xmlns:html="http://www.w3.org/1999/xhtml" class="inline">
<html:script src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js" async=""></html:script>
<html:ins style="display:inline-block;width:210px;height:210px" data-ad-client="ca-pub-0996811467255783" class="adsbygoogle" data-ad-slot="2990020277"></html:ins>
<html:script>(adsbygoogle = window.adsbygoogle || []).push({});</html:script>
</html:div>

Using beautiful soup to clean up scraped HTML from scrapy

I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link: http://scholar.google.com/scholar?q=intitle%3Apython+xpath
Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
which gives me the scrapy shell, inside which I do:
>>> sel.xpath('//h3[#class="gs_rt"]/a').extract()
[
u'<b>Python </b>Paradigms for XML',
u'NCClient: A <b>Python </b>Library for NETCONF Clients',
u'PALSE: <b>Python </b>Analysis of Large Scale (Computer) Experiments',
u'<b>Python </b>and XML',
u'drx: <b>Python </b>Programming Language [Computers: Programming: Languages: <b>Python</b>]-loadaverageZero',
u'XML and <b>Python </b>Tutorial',
u'Zato\u2014agile ESB, SOA, REST and cloud integrations in <b>Python</b>',
u'XML Processing with Perl, <b>Python</b>, and PHP',
u'<b>Python </b>& XML',
u'A <b>Python </b>Module for NETCONF Clients'
]
As you can see, this output is raw HTML that needs cleaning. I now have a good sense of how to clean this HTML up. The simplest way is probably to just BeautifulSoup and try something like:
t = sel.xpath('//h3[#class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
This is based off an earlier SO question. The regexp version has been suggested, but I am guessing that BeautifulSoup will be more robust.
I'm a scrapy n00b and can't figure out how to embed this in my spider. I tried
from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "scholar"
allowed_domains = ["scholar.google.com"]
start_urls = [
"http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
]
def parse(self, response):
sel = Selector(response)
item = ScholarscrapeItem()
t = sel.xpath('//h3[#class="gs_rt"]/a').extract()
soup = BeautifulSoup(t)
text_parts = soup.findAll(text=True)
text = ''.join(text_parts)
item['title'] = text
return(item)
But that didn't quite work. Any suggestions would be helpful.
Edit 3: Based on suggestions, I have modified my spider file to:
from scrapy.spider import Spider
from scrapy.selector import Selector
from bs4 import BeautifulSoup
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "dmoz"
allowed_domains = ["sholar.google.com"]
start_urls = [
"http://scholar.google.com/scholar?q=intitle%3Anine+facts+about+top+journals+in+economics"
]
def parse(self, response):
sel = Selector(response)
item = ScholarscrapeItem()
titles = sel.xpath('//h3[#class="gs_rt"]/a')
for title in titles:
title = item.xpath('.//text()').extract()
print "".join(title)
However, I get the following output:
2014-02-17 15:11:12-0800 [scrapy] INFO: Scrapy 0.22.2 started (bot: scholarscrape)
2014-02-17 15:11:12-0800 [scrapy] INFO: Optional features available: ssl, http11
2014-02-17 15:11:12-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scholarscrape.spiders', 'SPIDER_MODULES': ['scholarscrape.spiders'], 'BOT_NAME': 'scholarscrape'}
2014-02-17 15:11:12-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-17 15:11:13-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider opened
2014-02-17 15:11:13-0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-17 15:11:13-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-17 15:11:13-0800 [dmoz] DEBUG: Crawled (200) <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml> (referer: None)
2014-02-17 15:11:13-0800 [dmoz] ERROR: Spider error processing <GET http://scholar.google.com/scholar?q=intitle%3Apython+xml>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 368, in callback
self._startRunCallbacks(result)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 464, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 551, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/krishnan/work/research/journals/code/scholarscrape/scholarscrape/spiders/scholar_spider.py", line 20, in parse
title = item.xpath('.//text()').extract()
File "/Library/Python/2.7/site-packages/scrapy/item.py", line 65, in __getattr__
raise AttributeError(name)
exceptions.AttributeError: xpath
2014-02-17 15:11:13-0800 [dmoz] INFO: Closing spider (finished)
2014-02-17 15:11:13-0800 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 247,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 108851,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 196648),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/AttributeError': 1,
'start_time': datetime.datetime(2014, 2, 17, 23, 11, 13, 21701)}
2014-02-17 15:11:13-0800 [dmoz] INFO: Spider closed (finished)
Edit 2: My original question was quite different, but I am now convinced that this is the right way to proceed. Original question (and first edit below):
I'm using scrapy to try and scrape some data that I need off Google Scholar. Consider, as an example the following link:
http://scholar.google.com/scholar?q=intitle%3Apython+xpath
Now, I'd like to scrape all the titles off this page. The process that I am following is as follows:
scrapy shell "http://scholar.google.com/scholar?q=intitle%3Apython+xpath"
which gives me the scrapy shell, inside which I do:
>>> sel.xpath('string(//h3[#class="gs_rt"]/a)').extract()
[u'Python Paradigms for XML']
As you can see, this only selects the first title, and none of the others on the page. I can't figure out what I should modify my XPath to, so that I select all such elements on the page. Any help is greatly appreciated.
Edit 1: My first approach was to try
>>> sel.xpath('//h3[#class="gs_rt"]/a/text()').extract()
[u'Paradigms for XML', u'NCClient: A ', u'Library for NETCONF Clients',
u'PALSE: ', u'Analysis of Large Scale (Computer) Experiments', u'and XML',
u'drx: ', u'Programming Language [Computers: Programming: Languages: ',
u']-loadaverageZero', u'XML and ', u'Tutorial',
u'Zato\u2014agile ESB, SOA, REST and cloud integrations in ',
u'XML Processing with Perl, ', u', and PHP', u'& XML', u'A ',
u'Module for NETCONF Clients']
The problem with this is approach is that if you look at the actual Google Scholar page, you will see that the first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as scrapy returns. My guess for this behaviour is that 'Python' is trapped inside tags which is why text() is not doing what we want him to do.
This is a really interesting and rather difficult question. The problem you're facing concerns the fact that "Python" in the title is in bold, and it is treated as node, while the rest of the title is simply a text, therefore text() extracts only textual content and not content of <b> node.
Here's my solution. First get all the links:
titles = sel.xpath('//h3[#class="gs_rt"]/a')
then iterate over them and select all textual content of each node, in other words join <b> node with text node for each children of this link
for item in titles:
title = item.xpath('.//text()').extract()
print "".join(title)
This works because in a for loop you will be dealing with textual content of children of each link and thus you will be able to join matching elements. Title in the loop will be equal for instance :[u'Python ', u'Paradigms for XML'] or [u'NCClient: A ', u'Python ', u'Library for NETCONF Clients']
The XPath string() function only returns the string representation of the first node you pass to it.
Just extract nodes normally, don't use string().
sel.xpath('//h3[#class="gs_rt"]/a').extract()
or
sel.xpath('//h3[#class="gs_rt"]/a/text()').extract()
The first entry is actually 'Python Paradigms for XML' and not 'Paradigms for XML' as Scrapy returns.
You need to use normalize-space() that will return blank nodes as well because text() will ignore blank text nodes. So your initial XPath would look like this:
sel.xpath('//h3[#class="gs_rt"]/a').xpath("normalize-space()").extract()
Example:
# HTML has been simplified
from parsel import Selector
html = '''
<span class=gs_ctg2>[PDF]</span> erdincuzun.com
<div class="gs_ri">
<h3 class="gs_rt">
<span class="gs_ctc"><span class="gs_ct1">[PDF]</span><span class="gs_ct2">[PDF]</span></span>
<a href="https://erdincuzun.com/wp-content/uploads/download/plovdiv_2018_01.pdf">Comparison of <b>Python </b>libraries used for Web
data extraction</a>
'''
selector = Selector(HTML)
# get() to get textual data
print("Without normalize-space:\n", selector.xpath('//*[#class="gs_rt"]/a/text()').get())
print("\nWith normalize-space:\n", selector.xpath('//*[#class="gs_rt"]/a').xpath("normalize-space()").get())
"""
Without normalize-space:
Comparison
of
With normalize-space:
Comparison of Python libraries used for Web data extraction
"""
Actual code and example in the online IDE to get titles from Google Scholar Organic results:
import scrapy
class ScholarSpider(scrapy.Spider):
name = "scholar_titles"
allowed_domains = ["scholar.google.com"]
start_urls = ["https://scholar.google.com/scholar?q=intitle%3Apython+xpath"]
def parse(self, response):
for quote in response.xpath('//*[#class="gs_rt"]/a'):
yield {
"title": quote.xpath("normalize-space()").get()
}
Run it:
$ scrapy runspider -O <file_name>.jl <file_name>.py
-O stands for override. -o for appending.
jl is a JSON line file format.
Output with normalize-space():
{"title": "Comparison of Python libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with python"}
{"title": "News crawling based on Python crawler"}
{"title": "A survey on python libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on Python"}
{"title": "DECEPTIVE SECURITY USING PYTHON"}
{"title": "Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others"}
{"title": "Python Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy"}
{"title": "XML processing with Python"}
Output without normalize-space():
{"title": "Comparison of "}
{"title": "libraries used for Web data extraction"}
{"title": "Approaching the largest 'API': extracting information from the internet with "}
{"title": "News crawling based on "}
{"title": "crawler"}
{"title": "A survey on "}
{"title": "libraries used for social media content scraping"}
{"title": "Design and Implementation of Crawler Program Based on "}
{"title": "DECEPTIVE SECURITY USING "}
{"title": "Hands-On Web Scraping with "}
{"title": ": Perform advanced scraping operations using various "}
{"title": "libraries and tools such as Selenium, Regex, and others"}
{"title": "Paradigms for XML"}
{"title": "Using Web Scraping In A Knowledge Environment To Build Ontologies Using "}
{"title": "And Scrapy"}
{"title": "XML processing with "}
Alternatively, you can achieve it with Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan. You don't have to figure out the extraction part and maintain it, how to scale it and how to bypass blocks from search engines since it's already done for the end-user.
Example code to integrate:
from serpapi import GoogleScholarSearch
params = {
"api_key": "Your SerpApi API key", # API Key
"engine": "google_scholar", # parsing engine
"q": "intitle:python XPath", # search query
"hl": "en" # language
}
search = GoogleScholarSearch(params) # where extraction happens on the SerpApi back-end
results = search.get_dict() # JSON -> Python dictionary
for result in results["organic_results"]:
title = result["title"]
print(title)
Output:
Comparison of Python libraries used for Web data extraction
Approaching the largest 'API': extracting information from the internet with python
News crawling based on Python crawler
A survey on python libraries used for social media content scraping
Design and Implementation of Crawler Program Based on Python
DECEPTIVE SECURITY USING PYTHON
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Python Paradigms for XML
Using Web Scraping In A Knowledge Environment To Build Ontologies Using Python And Scrapy
XML processing with Python
Disclaimer, I work for SerpApi.

Resources