Xpath is correct but no result after scraping - xpath

I am trying to crawl all the name of the cities of the following web:
https://www.zomato.com/directory.
I have tried to used the following xpath.
python
#1st approach:
def parse(self,response):
cities_name = response.xpath('//div//h2//a/text()').extract_first()
items['cities_name'] = cities_name
yield items
#2nd approach:
def parse(self,response):
for city in response.xpath("//div[#class='col-l-5 col-s-8 item pt0 pb5
ml0']"):
l = ItemLoader(item = CountryItem(),selector = city)
l.add_xpath("cities_name",".//h2//a/text()")
yield l.load_item()
yield city
Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc

First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2 could end up being class2 class1 or even have some broken syntax involved like trailing spaces: class1 class2.
When you direct match your xpath to [#class="class1 class2"] there's a high chance that it will fail. Instead you should try to use contains function.
Second:
You have a tiny error in your cities_name xpath. In html body its a>h2>text and in your code it's reversed h2>a>text
So that being said I managed to get it working with these css and xpath selectors:
$ parsel "https://www.zomato.com/directory"
> p.mb10>a>h2::text +first
Adelaide
> p.mb10>a>h2::text +len
736
> -xpath
switched to xpath
> //p[contains(#class,"mb10")]/a/h2/text() +first
Adelaide
> //p[contains(#class,"mb10")]/a/h2/text() +len
736
parselcli - https://github.com/Granitosaurus/parsel-cli

You have a wrong XPath:
def parse(self,response):
for city_node in response.xpath("//h2"):
l = ItemLoader(item = CountryItem(), selector = city_node)
l.add_xpath("city_name", ".//a/text()")
yield l.load_item()

The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.
import scrapy
from bs4 import BeautifulSoup
class ZomatoSpider(scrapy.Spider):
name = "zomato"
start_urls= ['https://www.zomato.com/directory']
def parse(self, response):
soup = BeautifulSoup(response.text, 'html5lib')
for item in soup.select(".row h2 > a"):
yield {"name":item.text}

Related

How to select a specific part of the element in XPath?

I am working on scraping specific data of a website using scrapy and by copying the Xpath. I have this element, however I'd like to select only one part of the element.
The element/argument is:
{"cat42":"0","taal":"nl","loggedin":"false","cat1":"noordbrabant","cat2":"eindhoven","cat3":"5655jb","cat4":"83","cat6":"949949","cat7":"woonhuis","cat8":"6","cat9":"211","cat10":"c","cat11":"62584","cat12":"villa","cat13":"bestaandebouw","cat24":"0","cat26":"vbo","cat28":"1","cat29":"1978","cat30":"900000","cat33":"koop","cat34":"verkocht","cat35":"88909230","cat36":"0","cat38":"gemeenteeindhoven","cat39":"ooievaarsnest","cat43":"0","cat44":"0","postcode":"5655jb","plaats":"eindhoven","provincie":"noordbrabant","huisnummer":"83","woonoppervlakte":"211","vraagprijs":"949949","aantalkamers":"6","soortobject":"woonhuis","energieklasse":"c","hoofdaanbieder":"62584","bouwvorm":"bestaandebouw","soortwoning":"villa","bedrijfsruimte":"false","branchevereniging":"vbo","dakterras":"false","tuin":"true","balkon":"false","soortaanbieding":"koop","tinyid":"88909230","vraagprijsrange":"900000","bouwjaar":"1978","openhuis":"false","gemeente":"eindhoven","buurt":"ooievaarsnest","monumentalestatus":"false","rijksmonument":"false","soortaanbod":"koop","energiezuinig":"false","kluswoning":"false","adgroup":"b","status":"verkocht","environment":"production"}
I'd like to select only "tuin":"true" using xpath. I have tried: response.xpath('//tuin[#id="content"]/script[1]/text()').extract() but it gives me '[]' as a result.
So how can I select only the part that I want?
let me know if i am wrong
by help of xpath you can only select html tags
according to your requirement you have to check string content in specific element
you can't select script text by xpath directly you have add some logic for retrieve exact element.
here is sample code for extract
import scrapy
class FundaSpider(scrapy.Spider):
name = 'funda'
allowed_domains = ['funda.nl']
start_urls = ['https://www.funda.nl/koop/verkocht/eindhoven/huis-88909230-ulenpas-83/']
def parse(self, response):
tuin_json_texts = response.xpath('//script[#type="application/ld+json"]/text()').getall()
for single_json in tuin_json_texts:
expected_text = '\"tuin\":\"true\"'
if expected_text in single_json:
print(single_json.strip())
Example with scrapy shell:
scrapy shell https://www.funda.nl/koop/verkocht/eindhoven/huis-88909230-ulenpas-83/
In [1]: response.xpath('//main[#id="content"]/script//text()').get().strip()
Out[1]: '{"cat42":"0","taal":"nl","loggedin":"false","cat1":"noordbrabant","cat2":"eindhoven","cat3":"5655jb","cat4":"83","cat6":"949949","cat7":"woonhuis","cat8":"6","cat9":"211","cat10":"c","cat11":"62584","cat12":"villa","cat13":"bestaandebouw","cat24":"0","cat26":"vbo","cat28":"1","cat29":"1978","cat30":"900000","cat33":"koop","cat34":"verkocht","cat35":"88909230","cat36":"0","cat38":"gemeenteeindhoven","cat39":"ooievaarsnest","cat43":"0","cat44":"0","postcode":"5655jb","plaats":"eindhoven","provincie":"noordbrabant","huisnummer":"83","woonoppervlakte":"211","vraagprijs":"949949","aantalkamers":"6","soortobject":"woonhuis","energieklasse":"c","hoofdaanbieder":"62584","bouwvorm":"bestaandebouw","soortwoning":"villa","bedrijfsruimte":"false","branchevereniging":"vbo","dakterras":"false","tuin":"true","balkon":"false","soortaanbieding":"koop","tinyid":"88909230","vraagprijsrange":"900000","bouwjaar":"1978","openhuis":"false","gemeente":"eindhoven","buurt":"ooievaarsnest","monumentalestatus":"false","rijksmonument":"false","soortaanbod":"koop","energiezuinig":"false","kluswoning":"false","adgroup":"b","status":"verkocht","environment":"production"}'
Now you can get values with their keys by using json:
In [2]: data = response.xpath('//main[#id="content"]/script//text()').get().strip()
In [3]: import json
In [4]: json_data = json.loads(data)
In [5]: json_data['tuin']
Out[5]: 'true'
In [6]: json_data['environment']
Out[6]: 'production'
In [7]: json_data['woonoppervlakte']
Out[7]: '211'

Extracting content of <script> with Scrapy

I'm trying to extract the latitude and longitude from this page: https://www.realestate.com.kh/buy/nirouth/4-bed-5-bath-twin-villa-143957/
Where it can be found in this part of the page (the Xpath of this part is /html/head/script[8]):
<script type="application/ld+json">{"#context":"http://schema.org","#type":"Residence","address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"},"geo":{"#type":"GeoCoordinates","latitude":11.52,"longitude":104.95,"address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"}}}</script>
Here's my script :
import scrapy
class ScrapingSpider(scrapy.Spider):
name = 'scraping'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
lat = response.xpath('/html/head/script[8]')
print('----------------',lat)
yield {
'lat': lat
}
However, this Xpath yield an empty list. Is is because the content I'm looking for is in a JS script?
Since scrapy doesn't execute js, some <script> tag may be not be loaded into the page. For this reason using a index to pinpoint the element you want isn't a good idea. Better to search for something specific, my suggestion would be:
response.xpath('//head/script[contains(text(), "latitude")]')
Edit:
The above selector will return a selector list, from it you can choose how to parse. If you want to extract the whole text in script you can use:
response.xpath('//head/script[contains(text(), "latitude")]/text()').get()
If you want only the latitude value, you can use a regex:
response.xpath('//head/script[contains(text(), "latitude")]/text()').re_first(r'"latitude":(\d{1,3}\.\d{1,2})')
Docs on using regex methods of Selectors.

Get value of text (with no tag) in scrapy

I am trying to get the value of text (with no tag like <p>,<a> etc.) from this link
https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms
So far I have used scrapy shell to get their values using this code
item=response.xpath("//div[#class='Normal']/text()").extract()
Or
item=response.css('arttextxml *::text').extract()
The problem is that I am getting values when I use these commands in Scrapy Shell but when I use in my scrapy spyder file it return null value
Is there any solution for this problem?
there are multiple problems with your code.
First, it is messy. Second, the CSS selector you are using to get all link to the news articles, giving the same URL more than once. Third, as per your code, in scrapy.Request method calling, you used self.parseNews as a callback method, which is not even available in the whole file.
I have fixed your code on some level and right now, I am not facing any issue with it.
# -*- coding: utf-8 -*-
import scrapy
class TimesofindiaSpider(scrapy.Spider):
name = 'timesofindia'
allowed_domains = ["timesofindia.indiatimes.com"]
start_urls = ["https://timesofindia.indiatimes.com/World"]
base_url = "https://timesofindia.indiatimes.com/"
def parse(self, response):
for urls in response.css('div.top-newslist > ul > li'):
url = urls.css('a::attr(href)').extract_first()
yield scrapy.Request(self.base_url + url, callback = self.parse_save)
def parse_save(self, response):
print(response.xpath("//div[#class='Normal']/text()").extract())
I write a simple spider for you. You get your desired output.
Also show your code so i can correct you what you are doing wrong.
Scraper
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['timesofindia.indiatimes.com']
start_urls = ['https://timesofindia.indiatimes.com/us/donald-trump-boris-johnson-talk-5g-and-trade-ahead-of-g7-white-house/articleshow/70504270.cms']
def parse(self, response):
item = response.xpath('//div[#class="Normal"]/text()').extract()
yield{'Item':item}

Scrapy Xpath with text() contains

I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[#class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?
contains() can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :
//*[#class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :
//*[#class="ParamText"]/span[contains(.,"STODOLINK")]
In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?
I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item

Scrapy works in shell but not when I call my spider

I have been working on this for the past few hours, but cannot figure out what I'm doing wrong. When I run my xpath states using the selector in the scrapy shell, the statement works as expected. When I try to use the same statement in my spider, however, I get back an empty set. Does anyone know what I am doing wrong?
from scrapy.spider import Spider
from scrapy.selector import Selector
from TFFRS.items import Result
class AthleteSpider(Spider):
name = "athspider"
allowed_domains = ["www.tffrs.org"]
start_urls = ["http://www.tffrs.org/athletes/3237431/",]
def parse(self, response):
sel = Selector(response)
results = sel.xpath("//table[#id='results_data']/tr")
items = []
for r in results:
item = Result()
item['event'] = r.xpath("td[#class='event']").extract()
items.append(item)
return items
When viewed by the spider your url contains no content. To debug this kind of problems you should use scrapy.shell.inspect_response in parse method, use it like so:
from scrapy.shell import inspect_response
class AthleteSpider(Spider):
# all your code
def parse(self, response):
inspect_response(response, self)
then when you do
scrapy crawl <your spider>
you will get a shell from within your spider. There you should do:
In [1]: view(response)
This will display this particular response as it looks for this particular spider.
Try using HtmlXPathSelector for extracting xpaths.
Remove http from the start_urls section. Also the table id is something you are not entering correctly in your xpath. Try using inspect element to get a proper xpath for the data you want to scrape.
also consider changing function name, from docs:
Warning
When writing crawl spider rules, avoid using parse as callback, since
the CrawlSpider uses the parse method itself to implement its logic.
So if you override the parse method, the crawl spider will no longer
work
Scrapy spiders must implement specific methods; examples are: parse and start_requests but there are others in docs
So if you don't implement these methods for that, you will have problem. In my case the problem was i had a typo and my function name was start_request instead of start_requests!
so make sure your skeleton is something like this:
class MySpider(scrapy.Spider):
name = "name"
allowed_domains = ["https://example.com"]
start_urls = ['https://example.com/']
def start_requests(self):
#start_request method
def parse(self, response):
#parse method

Resources