Scrapy: '//select/option' xpath not yielding any results - xpath

I've been trying Scrapy and absolutely love it. However, one of the things I'm testing it in does not seem to work.
I'm trying to scrape a page (apple.com, for example) and save a list of the keyboard options available, using the simple xpath
//select/option
When using Chrome console, the website below comes back with an array of selections that I can easily iterate through, however, if I use scrapy.response.xpath('//select/option') via the scraper, or via the console, I get nothing back from it.
My code for the scraper looks a bit like the below (edited for simplicity)
import scrapy
from scrapy.linkextractors import LinkExtractor
from lxml import html
from apple.items import AppleItem
class ApplekbSpider(scrapy.Spider):
name = 'applekb'
allowed_domains = ['apple.com']
start_urls = ('http://www.apple.com/ae/shop/buy-mac/imac?product=MK482&step=config#', )
def parse(self, response):
for sel in response.xpath('//select/option'):
item = AppleItem()
item['country'] = sel.xpath('//span[#class="as-globalfooter-locale-name"]/text()').extract()
item['kb'] = sel.xpath('text()').extract()
item['code'] = sel.xpath('#value').extract()
yield item
As you can see I'm trying to get the code and text for each option, along with the site "Locale Name" (country).
As a side note, I've tried with CSS selectors to no avail. Anyone knows what I'm missing?
Thanks a lot in advance,
A

The problem is the usage of JavaScript by the webpage. When you open the url in the Chrome, the JavaScript code is executed by the browser, which generates the drop-down-menu with the keyboard options.
You should check out a headless-browser (PhantomJS etc.) which will do the JavaScript execution. With Splash, Scrapy offers its own headless-browser which can be easily integrated via scrapyjs.SplashMiddleware Downloader Middleware.
https://github.com/scrapy-plugins/scrapy-splash

The cause that //select/option does not find anything is that there is no select tag in the website when you load it with scrapy. That's because JavaScript is not executed and the dropdown is not filled with values.
Try to disable javascript from your Chrome developer tools' settings and you should see the same empty website what scrapy sees when you scrape the page.

Related

Crawl A Web Page with Scrapy and Python 2.7

Link: http://content.time.com/time/covers/0,16641,19230303,00.html [new DOM link]
Cover Page Html tag
How to get that SCR in Jason and download images
Next Button Tag
I want to scrap this 2 links using Scrapy
Any Help !!
I need to write a method to download images and click on next page, run them in for loop till final image get the download(Final Page).
how to download rest of part ill figure it out.
I follow this tutorial https://www.pyimagesearch.com/2015/10/12/scraping-images-with-python-and-scrapy/
[DOM is already outdated ]
I've already set all files and Pipelines for project
For Record, I tried different Different method XPath css response
https://github.com/Dhawal1306/Scrapy
Everything is done solution is on Github 4700 somewhere images we have and along with JSON also.
for a tutorial, any question you just have to ask !!
I know this is not scrapy but I found easier using BS4. so you have to "pip install beautifulsoup4". Here is a sample :
import requests
from bs4 import BeautifulSoup
import os
r = requests.get("https://mouradcloud.westeurope.cloudapp.azure.com/blog/blog/category/food/")
data = r.text
soup = BeautifulSoup(data, "lxml")
for link in soup.find_all('img'):
image_url = link.get("src")
print(image_url)
It worked like a charm

Is it possible to force fail a recaptcha v2 for testing purposes? (I.e. pretend to be a robot)

I'm implementing an invisible reCAPTCHA as per the instructions in the documentation: reCAPTCHA V2 documentation
I've managed to implement it without any problems. But, what I'd like to know is whether I can simulate being a robot for testing purposes?
Is there a way to force the reCAPTCHA to respond as if it thought I was a robot?
Thanks in advance for any assistance.
In the Dev Tools, open Settings, then Devices, add a custom device with any name and user agent equal to Googlebot/2.1.
Finally, in Device Mode, at the left of the top bar, choose the device (the default is Responsive).
You can test the captcha in https://www.google.com/recaptcha/api2/demo?invisible=true
(This is a demo of the Invisible Recaptcha. You can remove the url invisible parameter to test with the captcha button)
You can use a Chrome Plugin like Modify Headers and Add a user-agent like Googlebot/2.1 (+http://www.google.com/bot.html).
For Firefox, if you don't want to install any add-ons, you can easily manually change the user agent :
Enter about:config into the URL box and hit return;
Search for “useragent” (one word), just to check what is already there;
Create a new string (right-click somewhere in the window) titled (i.e. new
preference) “general.useragent.override”, and with string value
"Googlebot/2.1" (or any other you want to test with).
I tried this with Recaptcha v3, and it indeed returns a score of 0.1
And don't forget to remove this line from about:config when done testing !
I found this method here (it is an Apple OS article, but the Firefox method also works for Windows) : http://osxdaily.com/2013/01/16/change-user-agent-chrome-safari-firefox/
I find that if you click on the reCaptcha logo rather than the text box, it tends to fail.
This is because bots detect clickable hitboxes, and since the checkbox is an image, as well as the "I'm not a robot" text, and bots can't process images as text properly, but they CAN process clickable hitboxes, which the reCaptcha tells them to click, it just doesn't tell them where.
Click as far away from the checkbox as possible while keeping your mouse cursor in the reCaptcha. You will then most likely fail it. ( it will just bring up the thing where you have to identify the pictures).
The pictures are on there because like I said, bots can't process images and recognize things like cars.
yes it is possible to force fail a recaptcha v2 for testing purposes.
there are two ways to do that
First way :
you need to have firefox browser for that just make a simple form request
and then wait for response and after getting response click on refresh button firefox will prompt a box saying that " To display this page, Firefox must send information that will repeat any action (such as a search or order confirmation) that was performed earlier. " then click on "resend"
by doing this browser will send previous " g-recaptcha-response " key and this will fail your recaptcha.
Second way
you can make any simple post request by any application like in linux you can use curl to make post request.
just make sure that you specify all your form filed and also header for request and most important thing POST one field name as " g-recaptcha-response " and give any random value to this field
Just completing the answer of Rafael, follow how to use the plugin
None of proposed answers worked for me. I just wrote a simple Node.js script which opens a browser window with a page. ReCaptcha detects automated browser and shows the challenge. The script is below:
const puppeteer = require('puppeteer');
let testReCaptcha = async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('http://yourpage.com');
};
testReCaptcha();
Don't forget to install puppeteer by running npm i puppeteer and change yourpage.com to your page address

Automatic download file from web page

I am looking for a method to download automatically a file from a website.
Currently the process is really manual and heavy.
I go on a webpage, I enter my pass and login.
It opens a pop up, where I have to click a download button to save a .zip file.
Do you have any advice on how I could automate this task ?
I am on windows 7, and I can use mainly MS dos batch, or python. But I am open to other ideas.
You can use selenium web driver to automate the downloading. You can use below snippet for browser download preferences in java.
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("browser.download.folderList", 2);
profile.setPreference("browser.download.manager.showWhenStarting", false);
profile.setPreference("browser.download.dir", "C:\\downloads");
profile.setPreference("browser.helperApps.neverAsk.openFile","text/csv,application/x-msexcel,application/excel,application/x-excel,application/vnd.ms-excel,text/html,text/plain,application/msword,application/xml");
To handle the popup using this class when popup comes.
Robot robot = new Robot();
robot.keyPress(KeyEvent.VK_DOWN);
robot.keyRelease(KeyEvent.VK_DOWN);
robot.keyPress(KeyEvent.VK_ENTER);
robot.keyRelease(KeyEvent.VK_ENTER);
You'll want to take a look at requests (to fetch the html and the file), Beautifulsoup (to parse the html and find the links)
requests has built in auth: http://docs.python-requests.org/en/latest/
Beautifulsoup is quite easy to use: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pseudocode: use request to download the sites html and auth. Go through the links by parsing. If a link meets the criteria -> save in a list, else continue. When all the links have been scrapped, go through them and download the file using requests (req = requests.get('url_to_file_here', auth={'username','password'}), if req.status_code in [200], file = req.text
If you can post the link of the site you want to download from, maybe we can do more.

Can I set Watir to not open a window?

Until recently, I was using mechanize to get a web page, then do some parsing with nokogiri. But because some content was loaded with Ajax after I have started using Watir instead. My code looks like this:
def get_page(url)
browser = Watir::Browser.start url
sleep 1
page = Nokogiri::HTML.parse(browser.html)
browser.close
return page
end
It works fine but since I am getting a lot of pages the browser.start will open a tons of windows. I found the close as you see, but is there a way to just not show the browser window at all?

Installing an editor plugin in web2py app

I am trying to install an editor plugin "ck-editor4" in my web2py app, following the steps at:
http://www.web2pyslices.com/slice/show/1952/ck-editor4-plugin
and
https://bitbucket.org/PhreeStyle/web2py_ckeditor/wiki/Home
I wrote the given piece of code in my application's "model/db1.y" and "views/default/index.html" as directed in the above links, but things are not working correct. I am a newbie in web2py. Please help me installing an editor (preferably which supports programming languages) in detailed steps. Thanks!
This worked for me:
in db.py:
from plugin_ckeditor import CKEditor
ckeditor = CKEditor(db)
db.define_table('wpage',
Field('title'),
Field('body', 'text',widget=ckeditor.widget),
auth.signature, # created_on, created_by, modified_on, modified_by, is_active
format='%(title)s')
In default.py:
#
auth.requires_login()
def create():
"""creates a new empty wiki page"""
ckeditor.define_tables()
form = SQLFORM(db.wpage).process(next=URL('index'))
return dict(form=form)
I used ckeditor.define_tables() in edit() and show() too. Now in show.html, display the formatting using:
{{=XML(page.body,sanitize=False)}}
This is all explained in the links in your post and
https://github.com/timrichardson/web2py_ckeditor4
By the way, I found a simple way to install an editor for web2py app. This is not ck-editor4 plugin but NicEdit. Just two lines of javascript code will do the job. Following link came as rescue to me.
http://nicedit.com/
Just follow the simple steps in "Quick Start Guide" right hand side. Create a textarea and your work is done.

Resources