I'm trying to webscrape a page with about 20 articles, but for some reason the spider is only finding the information needed for the very first article. How do I make it scrape every article on the page?
I've tried changing the xpaths multiple times, but I think that I'm too new to this to be sure what the issue is. When I take all the paths out of the for loop it scraps everything well, but its not in a format that allows me to transfer the data to a csv file.
import scrapy
class AfgSpider(scrapy.Spider):
name = 'afg'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.xpath("//div[#id='taxonomy-page-block']")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title' : title,
'author' : author,
'rel_url' : rel_url
}
You can use this code to collect required information:
import scrapy
class AfgSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.css("div#taxonomy-page-block div.node-article")
for x in container:
title = x.xpath(".//h2[#class='node-title']/a/text()").get()
author = x.xpath(".//div[#class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[#class='node-title']/a/#href").get()
yield{
'title': title,
'author': author,
'rel_url': rel_url
}
The problem was that you code container = response.xpath("//div[#id='taxonomy-page-block']")
returns only one row, it's because id should be unique within the whole page, class can be the same for a few tags
Nice answer provided by #Roman. Another options to fix your script :
.Declaring the right XPath for your loop step :
container = response.xpath("//div[#class='node-inner clearfix']")
. Or, remove your loop step and use .getall() method to fetch the data :
title = response.xpath(".//h2[#class='node-title']/a/text()").getall()
author = response.xpath(".//div[#class='field-item even']/a/text()").getall()
rel_url = response.xpath(".//h2[#class='node-title']/a/#href").getall()
Related
I spend lot of time trying to scrape information with scrapy without sucess.
My goal is to surf through category and for each item scrape title,price and title's href link.
The problem seems to come from the parse_items function. I've check xpath with firepath and I'm able to select the items as wanted, so maybe I just don't catch how xpath are processed by scrapy...
Here is my code
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from ..items import electronic_Item
class robot_makerSpider(CrawlSpider):
name = "robot_makerSpider"
allowed_domains = ["robot-maker.com"]
start_urls = [
"http://www.robot-maker.com/shop/",
]
rules = (
Rule(LinkExtractor(
allow=(
"http://www.robot-maker.com/shop/12-kits-robots",
"http://www.robot-maker.com/shop/36-kits-debutants-arduino",
"http://www.robot-maker.com/shop/13-cartes-programmables",
"http://www.robot-maker.com/shop/14-shields",
"http://www.robot-maker.com/shop/15-capteurs",
"http://www.robot-maker.com/shop/16-moteurs-et-actionneurs",
"http://www.robot-maker.com/shop/17-drivers-d-actionneurs",
"http://www.robot-maker.com/shop/18-composants",
"http://www.robot-maker.com/shop/20-alimentation",
"http://www.robot-maker.com/shop/21-impression-3d",
"http://www.robot-maker.com/shop/27-outillage",
),
),
callback='parse_items',
),
)
def parse_items(self, response):
hxs = Selector(response)
products = hxs.xpath("//div[#id='center_column']/ul/li")
items = []
for product in products:
item = electronic_Item()
item['title'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/text()").extract()
item['price'] = product.xpath(
"div/div/div[3]/div/div[1]/span[1]/text()").extract()
item['url'] = product.xpath(
"li[1]/div/div/div[2]/h2/a/#href").extract()
#check that all field exist
if item['title'] and item['price'] and item['url']:
items.append(item)
return items
thanks for your help
The xpaths in your spider are indeed faulty.
Your first xpath for products does work but it's not explicit enough and might fail really easily. While the product detail xpaths are not working at all.
I've got it working with:
products = response.xpath("//div[#class='product-container']")
items = []
for product in products:
item = dict()
item['title'] = product.xpath('.//h2/a/text()').extract_first('').strip()
item['url'] = product.xpath('.//h2/a/#href').extract_first()
item['price'] = product.xpath(".//span[contains(#class,'product-price')]/text()").extract_first('').strip()
All modern websites have very parsing friendly html sources (since they need to parse it themselves for their fancy css styles and javascript functions).
So generally you should look at class and id names of nodes you want to extract with browser inspect tools (right click -> inspect element) instead of using some automated selection tool. it's more reliable and doesn't take much more work once you get the hang of it.
I'm trying to create a field “complete_name” that displays a hierarchy name similar to whats done on the product categories grid but I can't seem to get it to work. It just puts Odoo in an endless loading screen when I access the relevant view using the new field "complete_name".
I have tried to copy the code used in addons/product/product.py and migrate to work with Odoo 9 API by using compute instead of .function type but it did not work.
Can someone help me understand whats wrong? Below is my model class which works fine without the complete_name field in my view.
class cb_public_catalog_category( models.Model ):
_name = "cb.public.catalog.category"
_parent_store = True
parent_left = newFields.Integer( index = True )
parent_right = newFields.Integer( index = True )
name = newFields.Char( string = 'Category Name' )
child_id = newFields.One2many( 'catalog.category', 'parent_id', string = 'Child Categories' )
complete_name = newFields.Char( compute = '_name_get_fnc', string = 'Name' )
def _name_get_fnc( self ):
res = self.name_get( self )
return dict( res )
Your compute function is supposed to define the value of an attribute of your class, not return a value. Ensure the value you are assigning complete_name is a string.
Also name_get() returns a tuple. I am not sure if you really want a string representation of this tuple or just the actual name value.
Try this
def _name_get_fnc( self ):
self.complete_name = self.name_get()[1]
If you really want what is returned by name_get() then try this.
def _name_get_fnc( self ):
self.complete_name = str(self.name_get())
If you are still having issues I would incorporate some logging to get a better idea of what you are setting the value of complete_name to.
import logging
_logger = logging.getLogger(__name__)
def _name_get_fnc( self ):
_logger.info("COMPUTING COMPLETE NAME")
_logger.info("COMPLETE NAME: " + str(self.name_get()))
self.complete_name = self.name_get()
If this does not make it apparent what the issue is you could always try statically assigning it a value in the off chance that there is a problem with your view.
def _name_get_fnc( self ):
self.complete_name = "TEST COMPLETE NAME"
After further review I think I have the answer to my own question. It turns out as with a lot of things its very simple.
Simply use "_inherit" and inherit the product.category
model. This gives access to all the functions and fields
of product.category including the complete_name field
and computes the name from my custom model data. I was
able to remove my _name_get_func and just use the inherited
function.
The final model definition is below. Once this
update was complete I was able to add a "complete_name" field
to my view and the results were as desired!
class cb_public_catalog_category( models.Model ):
_name = "cb.public.catalog.category"
_inherit = 'product.category'
_parent_store = True
parent_left = newFields.Integer( index = True )
parent_right = newFields.Integer( index = True )
name = newFields.Char( string = 'Category Name' )
child_id = newFields.One2many( 'catalog.category', 'parent_id', string = 'Child Categories' )
I keep on getting an empty CSV file after running my code. I suspect it might be the XPaths but I really don't know what I'm doing. There aren't any errors reported in the terminal output. I'm trying to get info from various Craigslist pages.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from craigslist_probe.items import CraigslistSampleItem
class MySpider(Spider):
name = "why"
allowed_domains = ["craigslist.org"]
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
def parse(self, response):
titles = response.selector.xpath("/section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath("./div[#class='tray']").extract()
item["body"] = titles.xpath("./section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
return items
I suspect your XPath doesn't correspond to the HTML structure of the page. Notice that single slash (/) infers direct child, so, for example, /section would only work if the root element of the page is <section> element, which hardly ever be the case. Try using // all over :
def parse(self, response):
titles = response.selector.xpath("//section[#id='pagecontainer']")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["img"] = titles.xpath(".//div[#class='tray']").extract()
item["body"] = titles.xpath(".//section[#id='postingbody']/text()").extract()
item["itemID"] = titles.xpath(".//div[#class='postinginfos']/p[#class='postinginfo']").extract()
items.append(item)
I'm trying to create a webapp in django 1.9 for task tracking and ordering. The different tasks are divided into spaces (like different projects). Now, I want to be able to choose what the task is assigned to in the CreateView.
The problem is, that I have a large number of users in my system, so I do not want to show a dropdown. Instead, I want to use a TextInput widget, to have the form check for the available options (this way I can also use typeahead on the client side).
This is the best I could come up with for the TaskCreate view:
class TaskCreate(LoginRequiredMixin, CreateView):
"""
a view for creating new tasks
"""
model = Task
fields = ['space', 'name', 'description', 'assigned_to', 'due_date']
template_name = "task_tracker/task_form.html"
success_url = reverse_lazy('tracker:my_open_task_list')
def get_context_data(self, **kwargs):
context = super(TaskCreate, self).get_context_data(**kwargs)
context['header_caption'] = 'Create'
context['submit_caption'] = 'Create'
context['all_usernames'] = [x.username for x in User.objects.all()]
return context
def get_form(self, form_class=None):
form = super(TaskCreate, self).get_form(form_class)
form.fields['assigned_to'].choices = [(x.username, x.id) for x in User.objects.all()]
form.fields['assigned_to'].initial = self.request.user.username,
form.fields['assigned_to'].widget = widgets.TextInput()
try:
form.fields['space'].initial = Space.objects.get(name=self.request.GET['space'])
finally:
return form
def form_valid(self, form):
form.instance.created_by = self.request.user
form.instance.assigned_to = User.objects.get(username=form.cleaned_data['assigned_to'])
return super(TaskCreate, self).form_valid(form)
But the thing is that this is not working - the form still considers my choice to be illegal, even when I type in a valid username.
I tried to switch places the x.username and x.id in the choice field but it didn't help me.
I'm stuck on this for a week now. Can anybody help me please?
I am creating a set of things (each thing has FK to the set) directly with forms. The problem I am having is with the view(s).
I want to create the set for the things and then update all the things over and over using AJAX (Kind of like autosave). In my case the set is a SurveySet and the thing is a Survey.
def screen_many(request):
if not request.is_ajax():
# get an ordered QuerySet of students
students = ids_to_students(request.GET.items())
e_students = ids_to_students(request.GET.items(), 'e')
i_students = ids_to_students(request.GET.items(), 'i')
survey_count = len(students)
# Build a dataset of students with there associated behavior types.
data = [{'student':s.pk, 'behavior_type': 'E'} for s in e_students]
data += [{'student':s.pk, 'behavior_type': 'I'} for s in i_students]
# Use that dataset as initial data for a formset
SurveyFormset = formset_factory(SurveyForm, extra=0)
survey_formset = SurveyFormset(initial=data)
# ... not shown: customizing the crispy form helper
# Make a new survey set...
ss = SurveySet()
ss.user=request.user
ss.save()
if request.is_ajax():
surveyset = get_object_or_404(SurveySet, pk=ss.pk)
surveys = surveyset.survey_set.all()
survey_formset = SurveyFormset(request.POST, instance=surveyset)
if survey_formset.is_valid():
# Create surveys for this surveyset
for form in survey_formset.forms:
saved = form.save(commit=False)
saved.surveyset = ss
saved.save()
HttpResponse('saved.')
formsetWithStudents = zip(survey_formset.forms, students)
c = {
'formsetWithStudents' : formsetWithStudents,
'students' : students,
'survey_count' : survey_count,
'e_students': e_students,
'i_students': i_students,
'survey_formset': survey_formset,
}
c.update(csrf(request))
return render_to_response("reports/screen_many.html", c)
If my URL looks like this: http://127.0.0.1:8000/screen_many/?e_1=13&e_2=12&i_1=14 The view makes 3 survey sets all the while complaining that there is an
UnboundLocalError at /screen_many/
local variable 'ss' referenced before assignment
I feel like I need to make a separate view just for the ajax and I want the SurveySet object to only be created once.
So, in other words. I am filling in forms of a formset which update after clicking "view next form" This is in my template.
$('.next').click(function(){
$(this).parent().hide()
$(this).parent().next().show()
var posting = $.post('/screen_many/', $('form').serializeArray() );
posting.done(function(response){
console.log(response)
});
Or I could send the POST data here:
def save_as_you_go(request):
if request.is_ajax():
# Get the surveyset from POST
ss = request.POST['form-0-surveyset']
surveyset = get_object_or_404(SurveySet, pk=ss)
surveys = surveyset.survey_set.all()
SurveyFormSet = inlineformset_factory(SurveySet, Survey, form=SurveyForm, can_delete=False, extra=0)
survey_formset = SurveyFormSet(request.POST, instance=surveyset)
if survey_formset.is_valid():
for form in survey_formset.forms:
student = form.save(commit=False)
student.surveyset = surveyset
student.save()
return HttpResponse('saved.')
else:
return HttpResponseRedirect('/')
But then I get
[u'ManagementForm data is missing or has been tampered with']
Forgive me if my answer seems naive--I am new to Python and Django, but it looks like you are setting the ss variable in the non-ajax request and then referencing it in the ajax request. Perhaps you can set ss prior to the if statements?
#set ss variable before if statements
ss = SurveySet()
ss.user=request.user
ss.save()
if not request.is_ajax():
###do your non-ajax request stuff
if request.is_ajax():
###do your ajax request stuff