How to use LinkExtractor to get all urls in a website?

How to use LinkExtractor to get all urls in a website? - xpath

I wonder if there is a way to get all urls in the entire website. It seems that Scrapy with CrawSpider and LinkExtractor is a good choice. Consider this example:
from scrapy.item import Field, Item
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class SampleItem(Item):
link = Field()
class SampleSpider(CrawlSpider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
rules = (
Rule(LinkExtractor(), callback='parse_page', follow=True),
)
def parse_page(self, response):
item = SampleItem()
item['link'] = response.url
return item
This spider does not give me what I want. It only gives me all the links on a single webpage, namely, the start_url. But what I want is every link in this website, including those that are not on the start url. Did I understand the example correctly? Is there a solution to my problem? Thanks a lot!

Export each item via a Feed Export. This will result in a list of all links found on the site.
Or, write your own Item Pipeline to export all of your links to a file, database, or whatever you choose.
Another option would be to create a spider level list to which you append each URL, instead of using items at all. How you proceed will really depend on what you need from the spider, and how you intend to use it.

you could create a spider that gathers all the links in a page then for each of those links, check for the domain : if it is the same, parse those links, rinse , repeat.
There's no guarantee however that you'll catch all pages of the said domain, see How to get all webpages on a domain for a good overview of the issue in my opinion.
class SampleSpider(scrapy.Spider):
name = "sample_spider"
allowed_domains = ["domain.com"]
start_urls = ["http://domain.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a/#href').extract()
# make sure the parsed url is the domain related.
for u in urls:
# print('response url:{} | link url: {}'.format(response.url, u))
if urlsplit(u).netloc == urlsplit(response.url).netloc:
yield scrapy.Request(u, self.parse)

Related

Extracting content of <script> with Scrapy

I'm trying to extract the latitude and longitude from this page: https://www.realestate.com.kh/buy/nirouth/4-bed-5-bath-twin-villa-143957/
Where it can be found in this part of the page (the Xpath of this part is /html/head/script[8]):
<script type="application/ld+json">{"#context":"http://schema.org","#type":"Residence","address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"},"geo":{"#type":"GeoCoordinates","latitude":11.52,"longitude":104.95,"address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"}}}</script>
Here's my script :
import scrapy
class ScrapingSpider(scrapy.Spider):
name = 'scraping'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
lat = response.xpath('/html/head/script[8]')
print('----------------',lat)
yield {
'lat': lat
}
However, this Xpath yield an empty list. Is is because the content I'm looking for is in a JS script?

Since scrapy doesn't execute js, some <script> tag may be not be loaded into the page. For this reason using a index to pinpoint the element you want isn't a good idea. Better to search for something specific, my suggestion would be:
response.xpath('//head/script[contains(text(), "latitude")]')
Edit:
The above selector will return a selector list, from it you can choose how to parse. If you want to extract the whole text in script you can use:
response.xpath('//head/script[contains(text(), "latitude")]/text()').get()
If you want only the latitude value, you can use a regex:
response.xpath('//head/script[contains(text(), "latitude")]/text()').re_first(r'"latitude":(\d{1,3}\.\d{1,2})')
Docs on using regex methods of Selectors.

Get value of text (with no tag) in scrapy

I am trying to get the value of text (with no tag like <p>,<a> etc.) from this link
https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms
So far I have used scrapy shell to get their values using this code
item=response.xpath("//div[#class='Normal']/text()").extract()
Or
item=response.css('arttextxml *::text').extract()
The problem is that I am getting values when I use these commands in Scrapy Shell but when I use in my scrapy spyder file it return null value
Is there any solution for this problem?

there are multiple problems with your code.
First, it is messy. Second, the CSS selector you are using to get all link to the news articles, giving the same URL more than once. Third, as per your code, in scrapy.Request method calling, you used self.parseNews as a callback method, which is not even available in the whole file.
I have fixed your code on some level and right now, I am not facing any issue with it.
# -*- coding: utf-8 -*-
import scrapy
class TimesofindiaSpider(scrapy.Spider):
name = 'timesofindia'
allowed_domains = ["timesofindia.indiatimes.com"]
start_urls = ["https://timesofindia.indiatimes.com/World"]
base_url = "https://timesofindia.indiatimes.com/"
def parse(self, response):
for urls in response.css('div.top-newslist > ul > li'):
url = urls.css('a::attr(href)').extract_first()
yield scrapy.Request(self.base_url + url, callback = self.parse_save)
def parse_save(self, response):
print(response.xpath("//div[#class='Normal']/text()").extract())

I write a simple spider for you. You get your desired output.
Also show your code so i can correct you what you are doing wrong.
Scraper
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['timesofindia.indiatimes.com']
start_urls = ['https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms']
def parse(self, response):
item = response.xpath('//div[#class="Normal"]/text()').extract()
yield{'Item':item}

How to crawl multiple links in Scrapy following an xpath-based rule in the given start page?

I have created a spider that successfully extract the data I want from a single page, now I need it to crawl multiple similar pages and do the same.
The start page is going to be this one, here there are listed many unique item from the game (Araku tiki, sidhbreath etc), I want the spider to crawl all those items.
Given that as a start page, how to identifies which links to follow?
Here are the xpaths for the first 3 links i want it to follow:
//*[#id="mw-content-text"]/div[3]/table/tbody/tr[1]/td[1]/span/span[1]/a[1]
//*[#id="mw-content-text"]/div[3]/table/tbody/tr[2]/td[1]/span/span[1]/a[1]
//*[#id="mw-content-text"]/div[3]/table/tbody/tr[3]/td[1]/span/span[1]/a[1]
As you can see there is an increasing number in the middle, 1, then 2, then 3 and so on. How to crawl those pages?
Here is a snippet of my code working for the first item, Araku Tiki, having its page set as start:
import scrapy
from PoExtractor.items import PoextractorItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class RedditSpider(scrapy.Spider):
name = "arakaali"
# allowed_domains = ['pathofexile.gamepedia.com']
start_urls = ['https://pathofexile.gamepedia.com/Araku_Tiki']
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=()), callback="parse",
follow=True),
)
def parse(self, response):
item = PoextractorItem()
item["item_name"] = response.xpath("//*[#id='mw-content-text']/span/span[1]/span[1]/text()[1]").extract()
item["flavor_text"] = response.xpath("//*[#id='mw-content-text']/span/span[1]/span[2]/span[3]/text()").extract()
yield item
Please note: I have not be able to make it follow all the links in the start page either, my code only works if the start page is the one contained the requested data.
Thanks in advance for every reply.

You can send requests in many ways.
1.Since you are using scrapy, the following code can be used
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
parse_page1 will send requests to the url and you will get the response in parse_page2 function.
2.You can even send requests using python requests module,
import requests
resp = req.get("http://www.something.com")
print(resp.text)
Please comment if you have any doubt regarding this, thank you

Scrapy Xpath with text() contains

I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[#class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?

contains() can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :
//*[#class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :
//*[#class="ParamText"]/span[contains(.,"STODOLINK")]

In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?

I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item

Scrapy works in shell but not when I call my spider

I have been working on this for the past few hours, but cannot figure out what I'm doing wrong. When I run my xpath states using the selector in the scrapy shell, the statement works as expected. When I try to use the same statement in my spider, however, I get back an empty set. Does anyone know what I am doing wrong?
from scrapy.spider import Spider
from scrapy.selector import Selector
from TFFRS.items import Result
class AthleteSpider(Spider):
name = "athspider"
allowed_domains = ["www.tffrs.org"]
start_urls = ["http://www.tffrs.org/athletes/3237431/",]
def parse(self, response):
sel = Selector(response)
results = sel.xpath("//table[#id='results_data']/tr")
items = []
for r in results:
item = Result()
item['event'] = r.xpath("td[#class='event']").extract()
items.append(item)
return items

When viewed by the spider your url contains no content. To debug this kind of problems you should use scrapy.shell.inspect_response in parse method, use it like so:
from scrapy.shell import inspect_response
class AthleteSpider(Spider):
# all your code
def parse(self, response):
inspect_response(response, self)
then when you do
scrapy crawl <your spider>
you will get a shell from within your spider. There you should do:
In [1]: view(response)
This will display this particular response as it looks for this particular spider.

Try using HtmlXPathSelector for extracting xpaths.
Remove http from the start_urls section. Also the table id is something you are not entering correctly in your xpath. Try using inspect element to get a proper xpath for the data you want to scrape.
also consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work

Scrapy spiders must implement specific methods; examples are: parse and start_requests but there are others in docs
So if you don't implement these methods for that, you will have problem. In my case the problem was i had a typo and my function name was start_request instead of start_requests!
so make sure your skeleton is something like this:
class MySpider(scrapy.Spider):
name = "name"
allowed_domains = ["https://example.com"]
start_urls = ['https://example.com/']
def start_requests(self):
#start_request method
def parse(self, response):
#parse method

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use LinkExtractor to get all urls in a website? - xpath

Related

Extracting content of <script> with Scrapy

Get value of text (with no tag) in scrapy

How to crawl multiple links in Scrapy following an xpath-based rule in the given start page?

Scrapy Xpath with text() contains

Scrapy works in shell but not when I call my spider

Categories

Resources