Scrapy can't get data when following links

Scrapy can't get data when following links - xpath

I have asked a question like this Scrapy can't get data. But I have a new problem when using another spider. I've pay attention to the xpath, but it seems like there is an same error in this program.
Here is my spider's code:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from DB_Connection import DB_Con
class UniParc(Item):
database = Field()
identifier = Field()
version = Field()
organism = Field()
first_seen = Field()
last_seen = Field()
active = Field()
source = Field()
class UniParcSpider(CrawlSpider):
name = "UniParc"
allowed_domains = ["uniprot.org"]
start_urls = ["http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength"]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[#id="results"]/tr/td[2]/a',)), callback="parse_items", follow = True),
)
def parse_items(self, response):
hxs = Selector(response)
sites = hxs.xpath('//*[#id="results"]/tr')
db = DB_Con()
collection = db.getcollection(self.term)
for site in sites:
item = UniParc()
item["database"] = map(unicode.strip, site.xpath("td[1]/text()").extract())
item["identifier"] = map(unicode.strip, site.xpath("td[2]/a/text()").extract())
item["version"] = map(unicode.strip, site.xpath("td[3]/text()").extract())
item["organism"] = map(unicode.strip, site.xpath("td[4]/a/text()").extract())
item["first_seen"] = map(unicode.strip, site.xpath("td[5]/text()").extract())
item["last_seen"] = map(unicode.strip, site.xpath("td[6]/text()").extract())
item["active"] = map(unicode.strip, site.xpath("td[7]/text()").extract())
item['source'] = self.name
collection.update({"identifier": item['identifier']}, dict(item), upsert=True)
yield item
I used rules to extract the link which I want to follow and get data from it. But it seems like no urls have been got from the start_url.
Here is the log:
2016-05-28 22:28:54 [scrapy] INFO: Enabled item pipelines:
2016-05-28 22:28:54 [scrapy] INFO: Spider opened
2016-05-28 22:28:54 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-28 22:28:54 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-28 22:28:55 [scrapy] DEBUG: Crawled (200) <GET http://www.uniprot.org/uniparc/?query=rna&offset=25&sort=score&columns=id%2corganisms%2ckb%2cfirst-seen%2clast-seen%2clength> (referer: None)
2016-05-28 22:28:55 [scrapy] INFO: Closing spider (finished)
2016-05-28 22:28:55 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 314,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12263,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 28, 14, 28, 55, 638618),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 28, 14, 28, 54, 645490)}
So can anybody tell what's wrong with my code? Is there something wrong with my xpath? But I've check this so many times.

In order to fix the following the links step, just fix the XPath expression, replace:
//*[#id="results"]/tr/td[2]/a
with:
//*[#id="results"]//tr/td[2]/a
And, as a side note, you should not be inserting the extracted items into the database directly in the spider. For that, Scrapy offers the pipelines. In case of MongoDB, check out scrapy-mongodb. Also see:
Web Scraping With Scrapy and MongoDB

Related

Boto3 Amplify list apps

I have a lot of amplify apps which I want to manage via Lambdas. What is the equivalent of the cli command aws amplify list-apps in boto3, I had multiple attempts, but none worked out for me.
My bit of code that was using nextToken looked like this:
amplify = boto3.client('amplify')
apps = amplify.list_apps()
print(apps)
print('First token is: ', apps['nextToken'])
while 'nextToken' in apps:
apps = amplify.list_apps(nextToken=apps['nextToken'])
print('=====NEW APP=====')
print(apps)
print('=================')
Then I tried to use paginators like:
paginator = amplify.get_paginator('list_apps')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 100,
'PageSize': 100
}
)
for i in response_iterator:
print(i)
Both of the attempts were throwing inconsistent output. The first one was printing first token and second entry but nothing more. The second one gives only the first entry.
Edit with more attemptsinfo + output. Bellow piece of code:
apps = amplify.list_apps()
print(apps)
print('---------------')
new_app = amplify.list_apps(nextToken=apps['nextToken'], maxResults=100)
print(new_app)
print('---------------')```
Returns (some sensitive output bits were removed):
EVG_long_token_x4gbDGaAWGPGOASRtJPSI='}
---------------
{'ResponseMetadata': {'RequestId': 'f6...e9eb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/json', 'content-length': ...}, 'RetryAttempts': 0}, 'apps': [{'appId': 'dym7444jed2kq', 'appArn': 'arn:aws:amplify:us-east-2:763175725735:apps/dym7444jed2kq', 'name': 'vesting-interface', 'tags': {}, 'repository': 'https://github.com/...interface', 'platform': 'WEB', 'createTime': datetime.datetime(2021, 5, 4, 3, 41, 34, 717000, tzinfo=tzlocal()), 'updateTime': datetime.datetime(2021, 5, 4, 3, 41, 34, 717000, tzinfo=tzlocal()), 'environmentVariables': {}, 'defaultDomain': 'dym7444jed2kq.amplifyapp.com', 'customRules': _rules_, 'productionBranch': {'lastDeployTime': datetime.datetime(2021, 5, 26, 15, 10, 7, 694000, tzinfo=tzlocal()), 'status': 'SUCCEED', 'thumbnailUrl': 'https://aws-amplify-', 'branchName': 'main'}, - yarn install\n build:\n commands:\n - yarn run build\n artifacts:\n baseDirectory: build\n files:\n - '**/*'\n cache:\n paths:\n - node_modules/**/*\n", 'customHeaders': '', 'enableAutoBranchCreation': False}]}
---------------
I am very confused, why next iteration doesn't has nextToken and how can I get to the next appId.

import boto3
import json
session=boto3.session.Session(profile_name='<Profile_Name>')
amplify_client=session.client('amplify',region_name='ap-south-1')
output=amplify_client.list_apps()
print(output['apps'])

scrapy / python 3.5 : targeting and filtering

i want to extract the following field : movie's,director's,actors' name
on the page allocine.fr
This will help me to make my template for further scraps.
Here is my bad working code (inside spiders directory)
from scrapy.contrib.spiders import CrawlSpider, Rule
from cinefil.items import Article
#from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor ==> depreciated
from scrapy.linkextractors import LinkExtractor
from scrapy import log
class CinefilSpider(CrawlSpider):
name="cinefil"
allowed_domains = ["allocine.fr"]
start_urls = ["http://www.allocine.fr/film/fichefilm_gen_cfilm=29007.html"]
rules = [
Rule(LinkExtractor(allow=('(/film/)((?!:).)*$'),), callback="parse_item", follow=False)
]
def parse_item(self, response):
ROOTPATH = '//div[#class="meta-body-item"]'
item = Article()
casiers = response.xpath(ROOTPATH).extract()
for matos in casiers:
print("\n----- ------ ------ -------- ---------")
print(matos)
return item

For extracting the movie's,director's,actors' name on the page allocine.fr
Movie name
#get from <div class="titlebar-title titlebar-title-lg">
>>> movie=response.xpath('//div[#class="titlebar-title titlebar-title-lg"]/text()').extract_first()
>>> movie
u'Spider-Man'
Director name
#start from
#<span itemprop="director">
#<a>
#<span itemprop="name">
>>> director=response.xpath('//span[#itemprop="director"]/a/span[#itemprop="name"]/text()').extract()
>>> director
u'Sam Raimi'
Actors name
#Take the word "Avec" as landmark and get its siblings <spans>
>>> movie_stars=response.xpath('//span[contains(text(),"Avec")]/following-sibling::span/text()').extract()
>>> movie_stars
[u'Tobey Maguire', u'Willem Dafoe', u'Kirsten Dunst', u' plus ']
#remove last item 'plus'
>>> movie_stars.pop()
u' plus '
>>> movie_stars
[u'Tobey Maguire', u'Willem Dafoe', u'Kirsten Dunst']
And the items.py should be declared as :
import scrapy
class Movie(scrapy.Item):
name = scrapy.Field()
director = scrapy.Field()
actors = scrapy.Field()

Scrapy - Extract items from table

Trying to get my head around Scrapy but hitting a few dead ends.
I have a 2 Tables on a page and would like to extract the data from each one then move along to the next page.
Tables look like this (First one is called Y1, 2nd is Y2) and structures are the same.
<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
<h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">
<table class="table table-striped table-hover table-curved">
<thead>
<tr>
<th class="tCol1" style="padding: 10px;">First Col Head</th>
<th class="tCol2" style="padding: 10px;">Second Col Head</th>
<th class="tCol3" style="padding: 10px;">Third Col Head</th>
</tr>
</thead>
<tbody>
<tr>
<td>Info 1</td>
<td>Monday 5 September, 2016</td>
<td>Friday 21 October, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 2</b></td>
<td class="dtstart" timestamp="1477094400"><b></b></td>
<td class="dtend" timestamp="1477785600">
<b>Sunday 30 October, 2016</b></td>
</tr>
<tr>
<td>Info 3</td>
<td>Monday 31 October, 2016</td>
<td>Tuesday 20 December, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 4</b></td>
<td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
<td class="dtend" timestamp="1483315200">
<b>Monday 2 January, 2017</b></td>
</tr>
</tbody>
</table>
As you can see, the structure is a little inconsistent but as long as I can get each td and output to csv then I'll be a happy guy.
I tried using xPath but this only confused me more.
My last attempt:
import scrapy
class myScraperSpider(scrapy.Spider):
name = "myScraper"
allowed_domains = ["mysite.co.uk"]
start_urls = (
'https://mysite.co.uk/page1/',
)
def parse_products(self, response):
products = response.xpath('//*[#id="Y1"]/table')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('//*[#id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
item['first'] = product.xpath('//*[#id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
item['last'] = product.xpath('//*[#id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
yield item
No errors here but it just fires back lots of information about the crawl but no actual results.
Update:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = (
'https://termdates.co.uk/school-holidays-16-19-abingdon/',
)
def parse_products(self, response):
products = sel.xpath('//*[#id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
This give me: IndentationError: unexpected indent
if I run the amended script below (thanks to #Granitosaurus) to output to CSV (-o schoolDates.csv) I get an empty file:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse_products(self, response):
products = sel.xpath('//*[#id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
This is the log:
2017-03-23 12:04:08 [scrapy.core.engine] INFO: Spider opened
2017-03-23 12:04:08 [scrapy.extensions.logstats] INFO: Crawled 0
pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-03-23
12:04:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening
on ... 2017-03-23 12:04:08 [scrapy.core.engine] DEBUG: Crawled (200)
https://termdates.co.uk/robots.txt> (referer: None) 2017-03-23
12:04:08 [scrapy.core.engine] DEBUG: Crawled (200) https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer:
None) 2017-03-23 12:04:08 [scrapy.core.scraper] ERROR: Spider error
processing https://termdates.co.uk/school-holidays-16-19-abingdon/> (referer:
None) Traceback (most recent call last): File
"c:\python27\lib\site-packages\twisted\internet\defer.py", line 653,
in _ runCallbacks
current.result = callback(current.result, *args, **kw) File "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py",
line 76, in parse
raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO: Closing spider (finished) 2017-03-23
12:04:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 467, 'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 11311, 'downloader/response_count': 2,
'downloader/response_status_count/200': 2, 'finish_reason':
'finished', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8,
845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1,
'log_count/INFO': 7, 'response_received_count': 2,
'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1, 'start_time':
datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08
[scrapy.core.engine] INFO: Spider closed (finished)
Update 2: (Skips row)
This pushes result to csv file but skips every other row.
The Shell shows
{'hol': None, 'last': u'\r\n\t\t\t\t\t\t\t\t', 'first': None}
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse(self, response):
products = response.xpath('//*[#id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[2]/text()').extract_first()
item['last'] = p.xpath('td[3]/text()').extract_first()
yield item
Solution: Thanks to #vold
This crawls all pages in start_urls and deals with the inconsistent table layout
# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
'https://termdates.co.uk/school-holidays-3-dimensions',)
def parse(self, response):
products = response.xpath('//*[#id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]:
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
item['url'] = response.url
yield item

You need to slightly correct your code. Since you already select all elements within the table you don't need to point again to a table. Thus you can shorten your xpath to something like thistd[1]//text().
def parse_products(self, response):
products = response.xpath('//*[#id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = product.xpath('td[3]//text()').extract_first()
yield item
Edited my answer since #stutray provide the link to a site.

You can use CSS Selectors instead of xPaths, I always find CSS Selectors easy.
def parse_products(self, response):
for table in response.css("#Y1 table")[1:]:
item = Schooldates1Item()
item['hol'] = product.css('td:nth-child(1)::text').extract_first()
item['first'] = product.css('td:nth-child(2)::text').extract_first()
item['last'] = product.css('td:nth-child(3)::text').extract_first()
yield item
Also do not use tbody tag in selectors. Source:
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use in your XPath expressions.

I got it working with these xpaths for the html source you've provided:
products = sel.xpath('//*[#id="Y1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
Above assumes that each table row contains 1 item.

Defining URL list for crawler, syntax issues

I'm currently running the following code:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
def hltv_match_list(max_offset):
offset = 0
while offset < max_offset:
url = 'http://www.hltv.org/?pageid=188&offset=' + str(offset)
base = "http://www.hltv.org/"
soup = BeautifulSoup(requests.get("http://www.hltv.org/?pageid=188&offset=0").content, 'html.parser')
cont = soup.select("div.covMainBoxContent a[href*=matchid=]")
href = urljoin(base, (a["href"] for a in cont))
# print([urljoin(base, a["href"]) for a in cont])
get_hltv_match_data(href)
offset += 50
def get_hltv_match_data(matchid_url):
source_code = requests.get(matchid_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for teamid in soup.findAll("div.covSmallHeadline a[href*=teamid=]"):
print teamid.string
hltv_match_list(5)
Errors:
File "C:/Users/mdupo/PycharmProjects/HLTVCrawler/Crawler.py", line 12, in hltv_match_list
href = urljoin(base, (a["href"] for a in cont))
File "C:\Python27\lib\urlparse.py", line 261, in urljoin
urlparse(url, bscheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 143, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
File "C:\Python27\lib\urlparse.py", line 182, in urlsplit
i = url.find(':')
AttributeError: 'generator' object has no attribute 'find'
Process finished with exit code 1
I think I'm having trouble with the href = urljoin(base, (a["href"] for a in cont)) part as I'm trying to create a url list I can feed into get_hltv_match_datato then capture various items within that page. Am I going about this wrong?
Cheers

You need to join each href as per your commented code:
urls = [urljoin(base,a["href"]) for a in cont]
You are trying to join the base url to a generator i.e (a["href"] for a in cont) which makes no sense.
You should also be passing the url to requests or you are going to be requesting the same page over and over.
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

Ruby, spreadsheet by key

I am using ruby to call spreadsheet_by_key from a google document. The first page that I call works great, however when i try to duplicate it and use the second tab on the page it does not work. Let me better explain with some examples.
I am using:
data = session.spreadsheet_by_key("spreadsheetkeygoeshere").worksheets[0]
# Get Graph-Data
(2..data.num_rows).each do |column|
key = data[column, 10]
title = data[column, 2]
current = data[column, 3]
goal = data[column, 4]
send_event(key, title: title, min: 0, max: goal, value: current)
end
This works great and returns all of the expected values. Here is the problem I am having.. this is on the page 1 the first page that loads when you open google docs. Now lets say I wan't to make a new spreadsheet on the same doc just under a new tab with a different name and display that data as well
Here is how i change the code:
data1 = session.spreadsheet_by_key("spreadsheetkeygoeshere").worksheets[1]
# Get Graph-Data
(2..data1.num_rows).each do |column|
key = data[column, 10]
puts key
title = data[column, 1]
current = data[column, 5]
goal = data[column, 6]
send_event(key, title: title, min: 0, max: goal, value: current)
end
SO i changed the .worksheets[0] to .worksheets[1]
also i changed
(2..data.num_rows) to (2..data1.num_rows)
Also i changed the data = to data1 =
Any ideas on what i am doing wrong that causes the second spreadsheet to not get pulled ? Any help is greatly appreciated.

What worked was Cameron suggestion. I went in and changed everything to just data = instead of data1= and that fixed the problem.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scrapy can't get data when following links - xpath

Related

Boto3 Amplify list apps

scrapy / python 3.5 : targeting and filtering

Scrapy - Extract items from table

Defining URL list for crawler, syntax issues

Ruby, spreadsheet by key

Categories

Resources