how to scrape popups on the website with scrapy - xpath

I want to scrape the name, age and gender of the reviews on boots.com. For age and gender you can only see this data once you hover the mouse on the name in each review. First of all my I made the code for scraping the name but its not working. Second of all I don't know how to scrape age and gender from the pop up. Could you help me please. Thanks in advance.
Link:https://www.boots.com/clearasil-ultra-rapid-action-treatment-cream-25ml-10084703
Screenshot of popup
import scrapy
from ..items import BootsItem
from scrapy.loader import ItemLoader
class bootsSpider(scrapy.Spider):
name = 'boots'
start_urls = ['https://www.boots.com/clearasil-ultra-rapid-action-treatment-cream-25ml-10084703']
allowed_domains = ["boots.com"]
def parse(self, response):
reviews = response.xpath("//div[#class='bv-content-item-avatar-offset bv-content-item-avatar-offset-off']")
for review in reviews:
loader = ItemLoader(item=BootsItem(), selector=review, response=response)
loader.add_xpath("name", ".//div[#class='bv-content-reference-data bv-content-author-name']/span/text()")
yield loader.load_item()

Javascript is used to display the data (78 reviews in your case). You should use Selenium to scrape this. To display all the comments, you'll have to click multiple times on the following button :
//button[contains(#class,"load-more")]
Then, to scrape the name of all consumers you can use the following XPath (then use .text method to extract the data) :
//li//div[#class="bv-content-header-meta"][./span[#class="bv-content-rating bv-rating-ratio"]]//span[#class="bv-author"]/*/span
Output : 78 nodes
If you want to scrape the text reviews you can use :
//li//div[#class="bv-content-header-meta"][./span[#class="bv-content-rating bv-rating-ratio"]]/following::p[1]
Output : 78 nodes
To get the age and the gender of each consumer, you'll have to mouse over their names (see the preceding XPath) then fetch the value with the following XPath :
//span[#class="bv-author-userinfo-value"][preceding-sibling::span[#class="bv-author-userinfo-data"][.="Age"]]
//span[#class="bv-author-userinfo-value"][preceding-sibling::span[#class="bv-author-userinfo-data"][.="Gender"]]
Alternatively, if you don't want to/can't use Selenium, you can download the JSON (see the XHR requests in your browser) which contains everything you need.
https://api.bazaarvoice.com/data/batch.json?passkey=324y3dv5t1xqv8kal1wzrvxig&apiversion=5.5&displaycode=2111-en_gb&resource.q0=reviews&filter.q0=isratingsonly:eq:false&filter.q0=productid:eq:868029&filter.q0=contentlocale:eq:en_EU,en_GB,en_IE,en_US,en_CA&sort.q0=submissiontime:desc&stats.q0=reviews&filteredstats.q0=reviews&include.q0=authors,products,comments&filter_reviews.q0=contentlocale:eq:en_EU,en_GB,en_IE,en_US,en_CA&filter_reviewcomments.q0=contentlocale:eq:en_EU,en_GB,en_IE,en_US,en_CA&filter_comments.q0=contentlocale:eq:en_EU,en_GB,en_IE,en_US,en_CA&limit.q0=100&offset.q0=0&limit_comments.q0=3&callback=bv_1111_50671
For this case, I set the &limit.q0= to 100 and offset.q0 to 0 to be sure to fetch all the data. Once you get the JSON, you'll find all the information in : Batched Results>q0>Results>0,1,2,3,...,78
Output :
To download the JSON userequest and extract the data with json module.

Related

XPath to track if a product is: in stock/its price

I am a complete beginner who knows a little bit of HTML and Java so my question might sound very dumb.
I'm basically trying to use Google Spreadsheets in order to track the availability of an item/its price on this website. I'm using the "IMPORTXML" function and have no trouble getting the title of the product or its description. However I cannot get the price as it needs me to select a size first, which I don't know how to do through the "IMPORTXML" function.
Right now, this returns "Imported content is empty.":
=IMPORTXML("https://www.artisan-jp.com/fx-hien-eng.html","//p[#id='price']")
Would creating a function through Google Script work? If so, how do I do it?
Thank you!
You won't be able to do fetch any data with IMPORTXML since Javascript is used to display the price. With IMPORTFROMWEB addon, you can activate JS rendering but you'll only get the price of the default product.
It's probably better to use Selenium + Python (or any other language) to achieve your goal. That way you'll be able to click and select a specific product.(size, color, hardness)
If you really want to do this with a Google solution, you'll have to write your own custom function in Google Apps Script (send a POST request over a specific url : https://www.artisan-jp.com/get_syouhin.php). Something like :
function myFunction() {
var formData = {
'kuni': 'on',
'sir': '140',
'size': '1',
'color': '1',
};
var options = {
'method' : 'post',
'payload' : formData
};
Logger.log(UrlFetchApp.fetch('https://www.artisan-jp.com/get_syouhin.php', options).getContentText());
}
In the first part (formData), your declare the parameters of the POST. These parameters correspond to the properties of the product.
Sir :
XSoft = 140
Soft = 141
Mid = 142
Size :
S = 1
M = 2
L = 3
XL = 4
Color :
Red = 1
Black = 5
Output :
You'll get the reference number, the description of the product and its price.
When the product is not in stock, there's a preceding NON in the output.
It's up to you now to extract the data of interest from the output and to populate the cells of your workbook.
Assuming your function is named "mouse". Just use SPLIT to display the data properly.
=SPLIT(mouse();"/")
To extract the price only, you can use SPLIT then QUERY. SUBSTITUTE is used to coerce the result to a number.
=SUBSTITUTE(QUERY(SPLIT(mouse();"/");"select Col4");".";",")*1

How do I use the on_change method to calculate total current value

How do I make a value change in real time after I input a specific field value in a form? e.g from the screenshot below , if I enter Quantity recieved as 10000 the Actual stock should compute to 80500.
so far this is the code for the on_change method I came up with :
I would like to know whether this is the correct approach
#api.one
#api.onchange('qnty_recieved', 'init_stock')
def _compute_current_stock(self):
qnty_recieved = self.qnty_recieved
init_stock = self.init_stock
current_quantity = self.current_quantity
self.current_quantity = self.qnty_recieved + self.init_stock
Below is a screenshot of what I am trying to achieve.
If i'm not wrong you want to change your actual stock in real time based on quantity received field.
This can be best achieved by using depends method.
#api.one
#api.depends('qnty_recieved')
def _compute_current_stock(self):
# Assuming current_quantity as the field name of actual stock
self.current_quantity += self.qnty_recieved
You should also add
compute=_compute_current_stock, store=True keyword arguments to your actual stock field.

Google sheet importxml xpath query

I'm trying to get the data for "Next Earnings Announcement" for this http://www.bloomberg.com/quote/1880:HK site.
I have tried
=ImportXml( "http://www.bloomberg.com/quote/1880:HK", "//span[#class='company_stat']" )
=ImportXml( "http://www.bloomberg.com/quote/1880:HK", "/html/body/div[2]/div[1]/div[1]/div[2]/div[2]/div[2]/div[1]/div[1]/div[3]/table/tbody/tr[16]/td/text()" )
getting a #N/A, want 10/27/2014 as result
Instead of your fragile, horribly complicated XPath expression, try a useful one:
//th[normalize-space() = 'Next Earnings Announcement']/following-sibling::td

django-queryset: query many columns at once

I have a django app (w/postgresql database) that stores information on nest conditions for an endangered bird. Data is collected over multiple sites with different #'s of nests at each site. The nest conditions also have a unique date range per site.
DB Columns: site_name, date, nest_01, nest_02, nest_03 ... all the way to nest_1350.
The nests have values of either empty, 1E, 2E, 3E, or 4E.
Is there a way to do 1 query of all (1-1350) of the nest columns looking for '1E'?
Thanks
Do you actually have a model with 1350+ columns?
If I were you I'd normalize the whole setup like this:
class Site(Model):
site_name = Charfield()
date = DateField()
class Nest(Model):
name = Charfield()
condition = Charfield()
site = ForeignKey(Site)
And then query it like this:
site = Site.objects.get(pk=1) # just a Site, I assume you know a Site
nests = Nest.objects.filter(site=site).filter(condition='1E') # your desired nests

Facebook API: Getting photos that have a comment containing a a string

I'm implementing a pseudo hash-tagging system for the company I work at with their customer's facebook photos. A customer can upload a photo to the page, and a page admin can tag it with the hashtag for a product.
What I am trying to do is get all photos from the page that have a certain tag in the comments (for instance, get all photos from the company page with a comment containing only '#bluepants').
I am trying to make sure the Facebook API handles the heavy lifting (we'll cache the results), so I'd like to use FQL or the Graphs API, but I can't seem to get it working (my SQL is quite rusty after relying on an ORM for so long). I would prefer if it outputs as many results as possible, but I'm not sure if FB lets you do more than 25 at once.
This is going to be implemented in a sinatra site (I am currently playing around with the Koala gem, so bonus points if I can query using it)
Could anyone give me some guidance?
Thanks!
I've got something like this working in FQL/PHP. Here is my multiquery.
{'activity':
"SELECT post_id, created_time FROM stream WHERE source_id = PAGE_ID AND
attachment.fb_object_type = 'photo' AND created_time > 1338834720
AND comments.count > 0 LIMIT 0, 500",
'commented':
"SELECT post_id, text, fromid FROM comment WHERE post_id IN
(SELECT post_id FROM #activity) AND AND (strpos(upper(text), '#HASHTAG') >= 0",
'accepted':
"SELECT post_id, actor_id, message, attachment, place, created_time, likes
FROM stream WHERE post_id IN (SELECT post_id FROM #commented)
ORDER BY likes.count DESC",
'images':
"SELECT pid, src, src_big, src_small, src_width, src_height FROM photo
WHERE pid IN (SELECT attachment.media.photo.pid FROM #accepted)",
'users':
"SELECT name, uid, current_location, locale FROM user WHERE uid IN
(SELECT actor_id FROM #accepted)",
'pages':
"SELECT name, page_id FROM page WHERE page_id IN (SELECT actor_id FROM #accepted)",
'places':
"SELECT name, page_id, description, display_subtext, latitude, longitude
FROM place WHERE page_id IN (SELECT place FROM #accepted)"
}
To break this down:
#activity gets all stream objects created after the start date of my campaign that are photos and have a non-zero comment count. Using a LIMIT of 500 seems to return the maximum number of posts. Higher or lower values return fewer.
#commented finds the posts that have #HASHTAG in the text of one of their comments. Note, I'm not looking for a #, which is a reserved character in FQL. Using it may cause you problems.
#accepted gets the full details of the posts found in #commented.
#images gets all the details of the images in those posts. I have it on my todos to refactor this to use object_id instead of pid and try using the new real_width specification to make my layout easier.
#users and #pages get the details of the actor who originally posted the item. I now know I could have used the profile table to get this in one query.
#places gets the location details for geo-tagged posts.
You can see this in action here: http://getwellgabby.org/show-us-a-sign

Resources