Scrapy unable to scrape items, xpath not working - xpath

I spend lot of time trying to scrape information with scrapy without sucess.
My goal is to surf through category and for each item scrape title,price and title's href link.
The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...
Here is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item
class robot_makerSpider(CrawlSpider):
name = "robot_makerSpider"
allowed_domains = ["robot-maker.com"]
start_urls = [
"http://www.robot-maker.com/shop/",
]
rules = (
Rule(LinkExtractor(
allow=(
"http://www.robot-maker.com/shop/12-kits-robots",
"http://www.robot-maker.com/shop/36-kits-debutants-arduino",
"http://www.robot-maker.com/shop/13-cartes-programmables",
"http://www.robot-maker.com/shop/14-shields",
"http://www.robot-maker.com/shop/15-capteurs",
"http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
"http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
"http://www.robot-maker.com/shop/18-composants",
"http://www.robot-maker.com/shop/20-alimentation",
"http://www.robot-maker.com/shop/21-impression-3d",
"http://www.robot-maker.com/shop/27-outillage",
),
),
callback='parse_items',
),
)
def parse_items(self, response):
hxs = Selector(response)
products = hxs.xpath("//div[#id='center_column']/ul/li")
items = []
for product in products:
item = electronic_Item()
item['title'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/text()").extract()
item['price'] = product.xpath(
"div/div/div[3]/div/div[1]/span[1]/text()").extract()
item['url'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/#href").extract()
#check that all field exist
if item['title'] and item['price'] and item['url']:
items.append(item)
return items
thanks for your help

The xpaths in your spider are indeed faulty.
Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.
I've got it working with:
products = response.xpath("//div[#class='product-container']")
items = []
for product in products:
item = dict()
item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
item['url'] = product.xpath('.//h2/a/#href').extract_first()
item['price'] = product.xpath(".//span[contains(#class,'product-price')]/text()").extract_first('').strip()
All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).
So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.

Related

For loop not scraping all items, just one

I'm trying to webscrape a page with about 20 articles, but for some reason the spider is only finding the information needed for the very first article. How do I make it scrape every article on the page?
I've tried changing the xpaths multiple times, but I think that I'm too new to this to be sure what the issue is. When I take all the paths out of the for loop it scraps everything well, but its not in a format that allows me to transfer the data to a csv file.
import scrapy
class AfgSpider(scrapy.Spider):
name = 'afg'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.xpath("//div[#id='taxonomy-page-block']")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title' : title,
'author' : author,
'rel_url' : rel_url
}
You can use this code to collect required information:
import scrapy
class AfgSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.css("div#taxonomy-page-block div.node-article")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title': title,
'author': author,
'rel_url': rel_url
}
The problem was that you code container = response.xpath("//div[#id='taxonomy-page-block']")
returns only one row, it's because id should be unique within the whole page, class can be the same for a few tags
Nice answer provided by #Roman. Another options to fix your script :
.Declaring the right XPath for your loop step :
container = response.xpath("//div[#class='node-inner clearfix']")
. Or, remove your loop step and use .getall() method to fetch the data :
title = response.xpath(".//h2[#class='node-title']/a/text()").getall()
author = response.xpath(".//div[#class='field-item even']/a/text()").getall()
rel_url = response.xpath(".//h2[#class='node-title']/a/#href").getall()

Extracting data from multiple tables with Scrapy using xpath

I'm extracting meta data and urls from 12 tables on a web page and while I've got it working, I'm pretty new to both xpath and scrapy so is there a more concise way I could have done this?
I was initially getting loads of duplicates as I tried a variety of xpaths and realised each table row was being repeated for each table. My solution to that was to enumerate the tables and loop through each one grabbing the rows only for that table. Feels like there is probably a simpler way to do it but I'm not sure now.
import scrapy
class LinkCheckerSpider(scrapy.Spider):
name = 'foodstandardsagency'
allowed_domains = ['ratings.food.gov.uk']
start_urls = ['https://ratings.food.gov.uk/open-data/en-gb/']
def parse(self, response):
print(response.url)
tables = response.xpath('//*[#id="openDataStatic"]//table')
num_tables = len(tables)
for tabno in range(num_tables):
search_path = '// *[ # id = "openDataStatic"] / table[%d] / tr'%tabno
rows = response.xpath(search_path)
for row in rows:
local_authority = row.xpath('td[1]//text()').extract()
last_update = row.xpath('td[2]//text()').extract()
num_businesses = row.xpath('td[3]//text()').extract()
xml_file_descr = row.xpath('td[4]//text()').extract()
xml_file = row.xpath('td[4]/a/#href').extract()
yield {'local_authority': local_authority[1],
'last_update':last_update[1],
'num_businesses':num_businesses[1],
'xml_file':xml_file[0],
'xml_file_descr':xml_file_descr[1]
}
'''
And I'm running it with
scrapy runspider fsa_xpath.py
You can iterate though the table selectors returned by your first xpath:
tables = response.xpath('//*[#id="openDataStatic"]//table')
for table in tables:
for row in table.xpath('./tr'):
local_authority = row.xpath('td[1]//text()').extract()
You did this with the rows.

How to write this domain to be sure at 100 % that we will get the right stock pack operation for each invoice line?

This post should be a little more complex than usual.
We have created a new field for an account.invoice.line : pack_operation. With this field, we can print serial/lot number for each line on the PDF invoice (this part works well).
Many hours passed trying to write the domain to select the EXACT and ONLY stock pack operation for each invoice line.
In the code below, we used the domain [('id','=', 31)] to make our tests printing the PDF.
Ho to write this domain to be sure at 100 % that we will get the right stock pack operation for each invoice line?
I really need your help here... Too complex for my brain.
Our code :
class AccountInvoiceLine(models.Model):
_inherit = "account.invoice.line"
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')
def compute_stock_pack_operation_id(self):
stock_operation_obj = self.env['stock.pack.operation']
stock_operation = stock_operation_obj.search( [('id','=', 31)] )
self.pack_operation = stock_operation[0]
EDIT#1
I know that you won't like my code. But, this one seems to work. I take any comments and improvements with pleasure.
class AccountInvoiceLine(models.Model):
_inherit = "account.invoice.line"
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')#api.one
def compute_stock_pack_operation_id(self):
procurement_order_obj = self.env['procurement.order']
stock_operation_obj = self.env['stock.pack.operation']
all_picking_ids_for_this_invoice_line = []
for saleorderline in self.sale_line_ids:
for procurement in saleorderline.procurement_ids:
for stockmove in procurement.move_ids:
if stockmove.picking_id.id not in all_picking_ids_for_this_invoice_line
all_picking_ids_for_this_invoice_line.append(stockmove.picking_id.id)
all_picking_ids_for_this_invoice_line))
stock_operation = stock_operation_obj.search(
[ '&',
('picking_id','in',all_picking_ids_for_this_invoice_line),
('product_id','=',self.product_id.id)
]
)
self.pack_operation = stock_operation[0]
The pack_operation field is a computed field, that be default means that the field will not be saved on the database unless you set store=True when you define your field.
So, what you can do here is change:
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')
to:
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id', store=True)
And try running your query again.

Empty CSV after Scrapy Crawl

I keep on getting an empty CSV file after running my code. I suspect it might be the XPaths but I really don't know what I'm doing. There aren't any errors reported in the terminal output. I'm trying to get info from various Craigslist pages.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from craigslist_probe.items import CraigslistSampleItem
class MySpider(Spider):
name = "why"
allowed_domains = ["craigslist.org"]
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
titles = response.selector.xpath("/section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath("./div[#class='tray']").extract()
item["body"] = titles.xpath("./section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
return items
I suspect your XPath doesn't correspond to the HTML structure of the page. Notice that single slash (/) infers direct child, so, for example, /section would only work if the root element of the page is <section> element, which hardly ever be the case. Try using // all over :
def parse(self, response):
titles = response.selector.xpath("//section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath(".//div[#class='tray']").extract()
item["body"] = titles.xpath(".//section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)

Django Forms, ModelChoiceField using RadioSelect widget, Grouped by FK

The challenge, output a radio select in nested <ul></ul>, grouped by the task fk.
ie.
class Category(models.Model):
# ...
class Task(models.Model):
# ...
category = models.ForeignKey(Category)
# ...
forms.py
class ActivityForm(forms.ModelForm):
# ...
task = forms.ModelChoiceField(
queryset = Task.objects.all(),
widget = RadioSelectGroupedByFK
)
widgets.py
class RadioFieldRendererGroupedByFK(RadioFieldRenderer):
"""
An object used by RadioSelect to enable customization of radio widgets.
"""
#def __init__(self, attrs=None):
# Need a radio select for each?? Just an Idea.
#widgets = (RadioSelect(attrs=attrs), RadioSelect(attrs=attrs))
#super(RadioFieldRendererGroupedByFK, self).__init__(widgets, attrs)
def render(self):
"""Outputs nested <ul> for this set of radio fields."""
return mark_safe(
#### Somehow the crux of the work happens here? but how to get the
#### right context??
u'<ul>\n%s\n</ul>' % u'\n'.join(
[u'<li>%s</li>' % force_unicode(w) for w in self]
)
)
class RadioSelectGroupedByFK(forms.RadioSelect):
renderer = RadioFieldRendererGroupedByFK
Best thanks!!
itertools.groupby() is perfect for this. I'll assume that Task and Category each have a name attribute you want them sorted by. I don't know the Django APIs, so you'll probably want to move sorting into the db query, and you'll need to look at the widget's attributes to figure out how to access the task objects, but here's the basic formula:
from itertools import groupby
# sort first since groupby only groups adjacent items
tasks.sort(key=lambda task: (task.category.name, task.name))
for category, category_tasks in groupby(tasks, key=lambda task: task.category):
print '%s:' % category.name
for task in category_tasks:
print '* %s' % task.name
This should print a list like:
Breakfast foods:
* eggs
* spam
Dinner foods:
* spam
* spam
* spam
HTH

Resources