I keep on getting an empty CSV file after running my code. I suspect it might be the XPaths but I really don't know what I'm doing. There aren't any errors reported in the terminal output. I'm trying to get info from various Craigslist pages.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from craigslist_probe.items import CraigslistSampleItem
class MySpider(Spider):
name = "why"
allowed_domains = ["craigslist.org"]
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
titles = response.selector.xpath("/section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath("./div[#class='tray']").extract()
item["body"] = titles.xpath("./section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
return items
I suspect your XPath doesn't correspond to the HTML structure of the page. Notice that single slash (/) infers direct child, so, for example, /section would only work if the root element of the page is <section> element, which hardly ever be the case. Try using // all over :
def parse(self, response):
titles = response.selector.xpath("//section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath(".//div[#class='tray']").extract()
item["body"] = titles.xpath(".//section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
Related
I'm trying to webscrape a page with about 20 articles, but for some reason the spider is only finding the information needed for the very first article. How do I make it scrape every article on the page?
I've tried changing the xpaths multiple times, but I think that I'm too new to this to be sure what the issue is. When I take all the paths out of the for loop it scraps everything well, but its not in a format that allows me to transfer the data to a csv file.
import scrapy
class AfgSpider(scrapy.Spider):
name = 'afg'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.xpath("//div[#id='taxonomy-page-block']")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title' : title,
'author' : author,
'rel_url' : rel_url
}
You can use this code to collect required information:
import scrapy
class AfgSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.css("div#taxonomy-page-block div.node-article")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title': title,
'author': author,
'rel_url': rel_url
}
The problem was that you code container = response.xpath("//div[#id='taxonomy-page-block']")
returns only one row, it's because id should be unique within the whole page, class can be the same for a few tags
Nice answer provided by #Roman. Another options to fix your script :
.Declaring the right XPath for your loop step :
container = response.xpath("//div[#class='node-inner clearfix']")
. Or, remove your loop step and use .getall() method to fetch the data :
title = response.xpath(".//h2[#class='node-title']/a/text()").getall()
author = response.xpath(".//div[#class='field-item even']/a/text()").getall()
rel_url = response.xpath(".//h2[#class='node-title']/a/#href").getall()
I'm extracting meta data and urls from 12 tables on a web page and while I've got it working, I'm pretty new to both xpath and scrapy so is there a more concise way I could have done this?
I was initially getting loads of duplicates as I tried a variety of xpaths and realised each table row was being repeated for each table. My solution to that was to enumerate the tables and loop through each one grabbing the rows only for that table. Feels like there is probably a simpler way to do it but I'm not sure now.
import scrapy
class LinkCheckerSpider(scrapy.Spider):
name = 'foodstandardsagency'
allowed_domains = ['ratings.food.gov.uk']
start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']
def parse(self, response):
print(response.url)
tables = response.xpath('//*[#id="openDataStatic"]//table')
num_tables = len(tables)
for tabno in range(num_tables):
search_path = '// *[ # id = "openDataStatic"] / table[%d] / tr'%tabno
rows = response.xpath(search_path)
for row in rows:
local_authority = row.xpath('td[1]//text()').extract()
last_update = row.xpath('td[2]//text()').extract()
num_businesses = row.xpath('td[3]//text()').extract()
xml_file_descr = row.xpath('td[4]//text()').extract()
xml_file = row.xpath('td[4]/a/#href').extract()
yield {'local_authority': local_authority[1],
'last_update':last_update[1],
'num_businesses':num_businesses[1],
'xml_file':xml_file[0],
'xml_file_descr':xml_file_descr[1]
}
'''
And I'm running it with
scrapy runspider fsa_xpath.py
You can iterate though the table selectors returned by your first xpath:
tables = response.xpath('//*[#id="openDataStatic"]//table')
for table in tables:
for row in table.xpath('./tr'):
local_authority = row.xpath('td[1]//text()').extract()
You did this with the rows.
I spend lot of time trying to scrape information with scrapy without sucess.
My goal is to surf through category and for each item scrape title,price and title's href link.
The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...
Here is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item
class robot_makerSpider(CrawlSpider):
name = "robot_makerSpider"
allowed_domains = ["robot-maker.com"]
start_urls = [
"http://www.robot-maker.com/shop/",
]
rules = (
Rule(LinkExtractor(
allow=(
"http://www.robot-maker.com/shop/12-kits-robots",
"http://www.robot-maker.com/shop/36-kits-debutants-arduino",
"http://www.robot-maker.com/shop/13-cartes-programmables",
"http://www.robot-maker.com/shop/14-shields",
"http://www.robot-maker.com/shop/15-capteurs",
"http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
"http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
"http://www.robot-maker.com/shop/18-composants",
"http://www.robot-maker.com/shop/20-alimentation",
"http://www.robot-maker.com/shop/21-impression-3d",
"http://www.robot-maker.com/shop/27-outillage",
),
),
callback='parse_items',
),
)
def parse_items(self, response):
hxs = Selector(response)
products = hxs.xpath("//div[#id='center_column']/ul/li")
items = []
for product in products:
item = electronic_Item()
item['title'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/text()").extract()
item['price'] = product.xpath(
"div/div/div[3]/div/div[1]/span[1]/text()").extract()
item['url'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/#href").extract()
#check that all field exist
if item['title'] and item['price'] and item['url']:
items.append(item)
return items
thanks for your help
The xpaths in your spider are indeed faulty.
Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.
I've got it working with:
products = response.xpath("//div[#class='product-container']")
items = []
for product in products:
item = dict()
item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
item['url'] = product.xpath('.//h2/a/#href').extract_first()
item['price'] = product.xpath(".//span[contains(#class,'product-price')]/text()").extract_first('').strip()
All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).
So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.
I am trying to extract repeating child elements from an xpath.
This is a sample of the XML:
<productDetail>
<productTypeCode>123</productTypeCode>
<productPrice currency="EUR">13.27</productPrice>
<productPrice currency="US">15</productPrice>
</productDetail>
As you can see the productPrice currency node is repeating.
I am able to pull out one of them by looping through each of the elements:
#node.children.each do |c|
if c.name == "productDetail"
info = {}
productTypeCode = nil
c.children.each do |gc|
name = gc.name
if name == "productTypeCode"
productTypeCode = gc.text
elsif name == "productPrice"
info["productPrice"] = gc.text
attrs = gc.attributes
info["productPrice_cur"] = attrs["currency"].value
end
As you can see I only have the "productPrice" information once in the loop, but there are two of them in the XML data.
How do I access both of the values, seen as the xpath and value names are the same?
I am coding this in Ruby.
I'm trying to create a field “complete_name” that displays a hierarchy name similar to whats done on the product categories grid but I can't seem to get it to work. It just puts Odoo in an endless loading screen when I access the relevant view using the new field "complete_name".
I have tried to copy the code used in addons/product/product.py and migrate to work with Odoo 9 API by using compute instead of .function type but it did not work.
Can someone help me understand whats wrong? Below is my model class which works fine without the complete_name field in my view.
class cb_public_catalog_category( models.Model ):
_name = "cb.public.catalog.category"
_parent_store = True
parent_left = newFields.Integer( index = True )
parent_right = newFields.Integer( index = True )
name = newFields.Char( string = 'Category Name' )
child_id = newFields.One2many( 'catalog.category', 'parent_id', string = 'Child Categories' )
complete_name = newFields.Char( compute = '_name_get_fnc', string = 'Name' )
def _name_get_fnc( self ):
res = self.name_get( self )
return dict( res )
Your compute function is supposed to define the value of an attribute of your class, not return a value. Ensure the value you are assigning complete_name is a string.
Also name_get() returns a tuple. I am not sure if you really want a string representation of this tuple or just the actual name value.
Try this
def _name_get_fnc( self ):
self.complete_name = self.name_get()[1]
If you really want what is returned by name_get() then try this.
def _name_get_fnc( self ):
self.complete_name = str(self.name_get())
If you are still having issues I would incorporate some logging to get a better idea of what you are setting the value of complete_name to.
import logging
_logger = logging.getLogger(__name__)
def _name_get_fnc( self ):
_logger.info("COMPUTING COMPLETE NAME")
_logger.info("COMPLETE NAME: " + str(self.name_get()))
self.complete_name = self.name_get()
If this does not make it apparent what the issue is you could always try statically assigning it a value in the off chance that there is a problem with your view.
def _name_get_fnc( self ):
self.complete_name = "TEST COMPLETE NAME"
After further review I think I have the answer to my own question. It turns out as with a lot of things its very simple.
Simply use "_inherit" and inherit the product.category
model. This gives access to all the functions and fields
of product.category including the complete_name field
and computes the name from my custom model data. I was
able to remove my _name_get_func and just use the inherited
function.
The final model definition is below. Once this
update was complete I was able to add a "complete_name" field
to my view and the results were as desired!
class cb_public_catalog_category( models.Model ):
_name = "cb.public.catalog.category"
_inherit = 'product.category'
_parent_store = True
parent_left = newFields.Integer( index = True )
parent_right = newFields.Integer( index = True )
name = newFields.Char( string = 'Category Name' )
child_id = newFields.One2many( 'catalog.category', 'parent_id', string = 'Child Categories' )