Scrapy Xpath with text() contains - xpath

I'm using scrapy, and I'm trying to look for a span that contains a specific text. I have:
response.selector.xpath('//*[#class="ParamText"]/span/node()')
which returns:
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u'C'>,
<Selector xpath='//*[#class="ParamText"]/span/text()' data=u' MILES STODOLINK'>]
However when I run:
>>> response.selector.xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]')
Out[11]: []
Why does the contains function not work?

contains() can not evaluate multiple nodes at once :
/span[contains(text(),"STODOLINK")]
So, in case there are multiple text nodes within the span, and "STODOLINK" isn't located in the first text node child of the span, then contains() in the above expression won't work. You should try to apply contains() check on individual text nodes as follow :
//*[#class="ParamText"]/span[text()[contains(.,"STODOLINK")]]
Or if "STODOLINK" isn't necessarily located directly within span (can be nested within other element in the span), then you can simply use . instead of text() :
//*[#class="ParamText"]/span[contains(.,"STODOLINK")]

In my terminal (assuming my example is identical to your file though) your code works:
Input
import scrapy
example='<div class="ParamText"><span>STODOLINK</span></div>'
scrapy.Selector(text=example).xpath('//*[#class="ParamText"]/span[contains(text(),"STODOLINK")]').extract()
Output:
['<span>STODOLINK</span>']
Can you clarify what might be different?

I use Scrapy with BeautifulSoup4.0. IMO, Soup is easy to read and understand. This is an option if you don't have to use HtmlXPathSelector. Below is an example for finding all links. You can replace that with 'span'. Hope this helps!
import scrapy
from bs4 import BeautifulSoup
import Item
def parse(self, response):
soup = BeautifulSoup(response.body,'html.parser')
print 'Current url: %s' % response.url
item = Item()
for link in soup.find_all('a'):
if link.get('href') is not None:
url = response.urljoin(link.get('href'))
item['url'] = url
yield scrapy.Request(url,callback=self.parse)
yield item

Related

How to select a specific part of the element in XPath?

I am working on scraping specific data of a website using scrapy and by copying the Xpath. I have this element, however I'd like to select only one part of the element.
The element/argument is:
{"cat42":"0","taal":"nl","loggedin":"false","cat1":"noordbrabant","cat2":"eindhoven","cat3":"5655jb","cat4":"83","cat6":"949949","cat7":"woonhuis","cat8":"6","cat9":"211","cat10":"c","cat11":"62584","cat12":"villa","cat13":"bestaandebouw","cat24":"0","cat26":"vbo","cat28":"1","cat29":"1978","cat30":"900000","cat33":"koop","cat34":"verkocht","cat35":"88909230","cat36":"0","cat38":"gemeenteeindhoven","cat39":"ooievaarsnest","cat43":"0","cat44":"0","postcode":"5655jb","plaats":"eindhoven","provincie":"noordbrabant","huisnummer":"83","woonoppervlakte":"211","vraagprijs":"949949","aantalkamers":"6","soortobject":"woonhuis","energieklasse":"c","hoofdaanbieder":"62584","bouwvorm":"bestaandebouw","soortwoning":"villa","bedrijfsruimte":"false","branchevereniging":"vbo","dakterras":"false","tuin":"true","balkon":"false","soortaanbieding":"koop","tinyid":"88909230","vraagprijsrange":"900000","bouwjaar":"1978","openhuis":"false","gemeente":"eindhoven","buurt":"ooievaarsnest","monumentalestatus":"false","rijksmonument":"false","soortaanbod":"koop","energiezuinig":"false","kluswoning":"false","adgroup":"b","status":"verkocht","environment":"production"}
I'd like to select only "tuin":"true" using xpath. I have tried: response.xpath('//tuin[#id="content"]/script[1]/text()').extract() but it gives me '[]' as a result.
So how can I select only the part that I want?
let me know if i am wrong
by help of xpath you can only select html tags
according to your requirement you have to check string content in specific element
you can't select script text by xpath directly you have add some logic for retrieve exact element.
here is sample code for extract
import scrapy
class FundaSpider(scrapy.Spider):
name = 'funda'
allowed_domains = ['funda.nl']
start_urls = ['https://www.funda.nl/koop/verkocht/eindhoven/huis-88909230-ulenpas-83/']
def parse(self, response):
tuin_json_texts = response.xpath('//script[#type="application/ld+json"]/text()').getall()
for single_json in tuin_json_texts:
expected_text = '\"tuin\":\"true\"'
if expected_text in single_json:
print(single_json.strip())
Example with scrapy shell:
scrapy shell https://www.funda.nl/koop/verkocht/eindhoven/huis-88909230-ulenpas-83/
In [1]: response.xpath('//main[#id="content"]/script//text()').get().strip()
Out[1]: '{"cat42":"0","taal":"nl","loggedin":"false","cat1":"noordbrabant","cat2":"eindhoven","cat3":"5655jb","cat4":"83","cat6":"949949","cat7":"woonhuis","cat8":"6","cat9":"211","cat10":"c","cat11":"62584","cat12":"villa","cat13":"bestaandebouw","cat24":"0","cat26":"vbo","cat28":"1","cat29":"1978","cat30":"900000","cat33":"koop","cat34":"verkocht","cat35":"88909230","cat36":"0","cat38":"gemeenteeindhoven","cat39":"ooievaarsnest","cat43":"0","cat44":"0","postcode":"5655jb","plaats":"eindhoven","provincie":"noordbrabant","huisnummer":"83","woonoppervlakte":"211","vraagprijs":"949949","aantalkamers":"6","soortobject":"woonhuis","energieklasse":"c","hoofdaanbieder":"62584","bouwvorm":"bestaandebouw","soortwoning":"villa","bedrijfsruimte":"false","branchevereniging":"vbo","dakterras":"false","tuin":"true","balkon":"false","soortaanbieding":"koop","tinyid":"88909230","vraagprijsrange":"900000","bouwjaar":"1978","openhuis":"false","gemeente":"eindhoven","buurt":"ooievaarsnest","monumentalestatus":"false","rijksmonument":"false","soortaanbod":"koop","energiezuinig":"false","kluswoning":"false","adgroup":"b","status":"verkocht","environment":"production"}'
Now you can get values with their keys by using json:
In [2]: data = response.xpath('//main[#id="content"]/script//text()').get().strip()
In [3]: import json
In [4]: json_data = json.loads(data)
In [5]: json_data['tuin']
Out[5]: 'true'
In [6]: json_data['environment']
Out[6]: 'production'
In [7]: json_data['woonoppervlakte']
Out[7]: '211'

Extracting content of <script> with Scrapy

I'm trying to extract the latitude and longitude from this page: https://www.realestate.com.kh/buy/nirouth/4-bed-5-bath-twin-villa-143957/
Where it can be found in this part of the page (the Xpath of this part is /html/head/script[8]):
<script type="application/ld+json">{"#context":"http://schema.org","#type":"Residence","address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"},"geo":{"#type":"GeoCoordinates","latitude":11.52,"longitude":104.95,"address":{"#type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"}}}</script>
Here's my script :
import scrapy
class ScrapingSpider(scrapy.Spider):
name = 'scraping'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
lat = response.xpath('/html/head/script[8]')
print('----------------',lat)
yield {
'lat': lat
}
However, this Xpath yield an empty list. Is is because the content I'm looking for is in a JS script?
Since scrapy doesn't execute js, some <script> tag may be not be loaded into the page. For this reason using a index to pinpoint the element you want isn't a good idea. Better to search for something specific, my suggestion would be:
response.xpath('//head/script[contains(text(), "latitude")]')
Edit:
The above selector will return a selector list, from it you can choose how to parse. If you want to extract the whole text in script you can use:
response.xpath('//head/script[contains(text(), "latitude")]/text()').get()
If you want only the latitude value, you can use a regex:
response.xpath('//head/script[contains(text(), "latitude")]/text()').re_first(r'"latitude":(\d{1,3}\.\d{1,2})')
Docs on using regex methods of Selectors.

Xpath is correct but no result after scraping

I am trying to crawl all the name of the cities of the following web:
https://www.zomato.com/directory.
I have tried to used the following xpath.
python
#1st approach:
def parse(self,response):
cities_name = response.xpath('//div//h2//a/text()').extract_first()
items['cities_name'] = cities_name
yield items
#2nd approach:
def parse(self,response):
for city in response.xpath("//div[#class='col-l-5 col-s-8 item pt0 pb5
ml0']"):
l = ItemLoader(item = CountryItem(),selector = city)
l.add_xpath("cities_name",".//h2//a/text()")
yield l.load_item()
yield city
Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc
First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2 could end up being class2 class1 or even have some broken syntax involved like trailing spaces: class1 class2.
When you direct match your xpath to [#class="class1 class2"] there's a high chance that it will fail. Instead you should try to use contains function.
Second:
You have a tiny error in your cities_name xpath. In html body its a>h2>text and in your code it's reversed h2>a>text
So that being said I managed to get it working with these css and xpath selectors:
$ parsel "https://www.zomato.com/directory"
> p.mb10>a>h2::text +first
Adelaide
> p.mb10>a>h2::text +len
736
> -xpath
switched to xpath
> //p[contains(#class,"mb10")]/a/h2/text() +first
Adelaide
> //p[contains(#class,"mb10")]/a/h2/text() +len
736
parselcli - https://github.com/Granitosaurus/parsel-cli
You have a wrong XPath:
def parse(self,response):
for city_node in response.xpath("//h2"):
l = ItemLoader(item = CountryItem(), selector = city_node)
l.add_xpath("city_name", ".//a/text()")
yield l.load_item()
The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.
import scrapy
from bs4 import BeautifulSoup
class ZomatoSpider(scrapy.Spider):
name = "zomato"
start_urls= ['https://www.zomato.com/directory']
def parse(self, response):
soup = BeautifulSoup(response.text, 'html5lib')
for item in soup.select(".row h2 > a"):
yield {"name":item.text}

Problems with '._ElementUnicodeResult'

While trying to help another user out with some question, I ran into the following problem myself:
The object is to find the country of origin of a list of wines on the page. So we start with:
import requests
from lxml import etree
url = "https://www.winepeople.com.au/wines/Dry-Red/_/N-1z13zte"
res = requests.get(url)
content = res.content
res = requests.get(url)
tree = etree.fromstring(content, parser=etree.HTMLParser())
tree_struct = etree.ElementTree(tree)
Next, for reasons I'll get into in a separate question, I'm trying to compare the xpath of two elements with certain attributes. So:
wine = tree.xpath("//div[contains(#class, 'row wine-attributes')]")
country = tree.xpath("//div/text()[contains(., 'Australia')]")
So far, so good. What are we dealing with here?
type(wine),type(country)
>> (list, list)
They are both lists. Let's check the type of the first element in each list:
type(wine[0]),type(country[0])
>> (lxml.etree._Element, lxml.etree._ElementUnicodeResult)
And this is where the problem starts. Because, as mentioned, I need to find the xpath of the first elements of the wine and country lists. And when I run:
tree_struct.getpath(wine[0])
The output is, as expected:
'/html/body/div[13]/div/div/div[2]/div[6]/div[1]/div/div/div[2]/div[2]'
But with the other:
tree_struct.getpath(country[0])
The output is:
TypeError: Argument 'element' has incorrect type (expected
lxml.etree._Element, got lxml.etree._ElementUnicodeResult)
I couldn't find much information about _ElementUnicodeResult), so what is it? And, more importantly, how do I fix the code so that I get an xpath for that node?
You're selecting a text() node instead of an element node. This is why you end up with a lxml.etree._ElementUnicodeResult type instead of a lxml.etree._Element type.
Try changing your xpath to the following in order to select the div element instead of the text() child node of div...
country = tree.xpath("//div[contains(., 'Australia')]")

Lxml or Xpath content print

I have the following function
def parseTitle(self, post):
"""
Returns title string with spaces replaced by dots
""
return post.xpath('h2')[0].text.replace('.', ' ')
I would to see the content of post. I have tried everything I can think of.
How can I properly debug the content? This is an website of movies where I'm rip links and title and this function should parse the title.
I am sure H# is not existing, how can I print/debug this?
post is lxml element tree object, isn't it?
so first, you could try:
# import lxml.html # if not yet imported
# (or you can use lxml.etree instead of lxml.html)
print lxml.html.tostring(post)
if isn't, you should create element tree object from it
post = lxml.html.fromstring(post)
or maybe the problem is just that you should replace h2 with //h2?
your question is not very explanatory..

Resources