Crawl A Web Page with Scrapy and Python 2.7 - image

Link: http://content.time.com/time/covers/0,16641,19230303,00.html [new DOM link]
Cover Page Html tag
How to get that SCR in Jason and download images
Next Button Tag
I want to scrap this 2 links using Scrapy
Any Help !!
I need to write a method to download images and click on next page, run them in for loop till final image get the download(Final Page).
how to download rest of part ill figure it out.
I follow this tutorial https://www.pyimagesearch.com/2015/10/12/scraping-images-with-python-and-scrapy/
[DOM is already outdated ]
I've already set all files and Pipelines for project
For Record, I tried different Different method XPath css response

https://github.com/Dhawal1306/Scrapy
Everything is done solution is on Github 4700 somewhere images we have and along with JSON also.
for a tutorial, any question you just have to ask !!

I know this is not scrapy but I found easier using BS4. so you have to "pip install beautifulsoup4". Here is a sample :
import requests
from bs4 import BeautifulSoup
import os
r = requests.get("https://mouradcloud.westeurope.cloudapp.azure.com/blog/blog/category/food/")
data = r.text
soup = BeautifulSoup(data, "lxml")
for link in soup.find_all('img'):
image_url = link.get("src")
print(image_url)
It worked like a charm

Related

Google API + proxy + httplib2

I'm currently running a script to pull data from Google Analytics with googleapiclient Python package (that is based on httplib2 client object)
--> My script works perfectly without any proxy.
But I have to put it behind my corporate proxy, so I need to adapt my httplib2.Http() object to embed proxy information.
Following httplib2 doc 1 I tried:
pi = httplib2.proxy_info_from_url('http://user:pwd#someproxy:80')
httplib2.Http(proxy_info=pi).request("http://www.google.com")
But it did not work.
I always get a Time out error, with or without the proxy info (so proxy_info in parameter is not taken into account)
I also downloaded socks in PySocks package (v1.5.6) and tried to "wrapmodule" httplib2 as described in here:
https://github.com/jcgregorio/httplib2/issues/205
socks.setdefaultproxy(socks.PROXY_TYPE_HTTP, "proxyna", port=80, username='p.tisserand', password='Telematics12')
socks.wrapmodule(httplib2)
h = httplib2.Http()
h.request("http://google.com")
But I get an IndexError: (tuple index out of range)
In the meantime,
When I use the requests package, this simple code works perfectly:
os.environ["HTTP_PROXY"] = "http://user:pwd#someproxy:80"
req = requests.get("http://www.google.com")
The problem is that need to fit with googleapiclient requirements and provide a htpplib2.Http() client object.
rather than using Python2, I think you'd better try using httplib2shim
You can have a look at this tutorial on my blog :
https://dinatam.com/fr/python-3-google-api-proxy/
In simple words, just replace this kind of code :
from httplib2 import Http
http_auth = credentials.authorize(Http())
by this one :
import httplib2shim
http_auth = credentials.authorize(httplib2shim.Http())
I decided to recode my web app in Python 2, still using the httplib2 package.
Proxy info are now taken into account. It now works.

Scrapy: '//select/option' xpath not yielding any results

I've been trying Scrapy and absolutely love it. However, one of the things I'm testing it in does not seem to work.
I'm trying to scrape a page (apple.com, for example) and save a list of the keyboard options available, using the simple xpath
//select/option
When using Chrome console, the website below comes back with an array of selections that I can easily iterate through, however, if I use scrapy.response.xpath('//select/option') via the scraper, or via the console, I get nothing back from it.
My code for the scraper looks a bit like the below (edited for simplicity)
import scrapy
from scrapy.linkextractors import LinkExtractor
from lxml import html
from apple.items import AppleItem
class ApplekbSpider(scrapy.Spider):
name = 'applekb'
allowed_domains = ['apple.com']
start_urls = ('http://www.apple.com/ae/shop/buy-mac/imac?product=MK482&step=config#', )
def parse(self, response):
for sel in response.xpath('//select/option'):
item = AppleItem()
item['country'] = sel.xpath('//span[#class="as-globalfooter-locale-name"]/text()').extract()
item['kb'] = sel.xpath('text()').extract()
item['code'] = sel.xpath('#value').extract()
yield item
As you can see I'm trying to get the code and text for each option, along with the site "Locale Name" (country).
As a side note, I've tried with CSS selectors to no avail. Anyone knows what I'm missing?
Thanks a lot in advance,
A
The problem is the usage of JavaScript by the webpage. When you open the url in the Chrome, the JavaScript code is executed by the browser, which generates the drop-down-menu with the keyboard options.
You should check out a headless-browser (PhantomJS etc.) which will do the JavaScript execution. With Splash, Scrapy offers its own headless-browser which can be easily integrated via scrapyjs.SplashMiddleware Downloader Middleware.
https://github.com/scrapy-plugins/scrapy-splash
The cause that //select/option does not find anything is that there is no select tag in the website when you load it with scrapy. That's because JavaScript is not executed and the dropdown is not filled with values.
Try to disable javascript from your Chrome developer tools' settings and you should see the same empty website what scrapy sees when you scrape the page.

Automatic download file from web page

I am looking for a method to download automatically a file from a website.
Currently the process is really manual and heavy.
I go on a webpage, I enter my pass and login.
It opens a pop up, where I have to click a download button to save a .zip file.
Do you have any advice on how I could automate this task ?
I am on windows 7, and I can use mainly MS dos batch, or python. But I am open to other ideas.
You can use selenium web driver to automate the downloading. You can use below snippet for browser download preferences in java.
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("browser.download.folderList", 2);
profile.setPreference("browser.download.manager.showWhenStarting", false);
profile.setPreference("browser.download.dir", "C:\\downloads");
profile.setPreference("browser.helperApps.neverAsk.openFile","text/csv,application/x-msexcel,application/excel,application/x-excel,application/vnd.ms-excel,text/html,text/plain,application/msword,application/xml");
To handle the popup using this class when popup comes.
Robot robot = new Robot();
robot.keyPress(KeyEvent.VK_DOWN);
robot.keyRelease(KeyEvent.VK_DOWN);
robot.keyPress(KeyEvent.VK_ENTER);
robot.keyRelease(KeyEvent.VK_ENTER);
You'll want to take a look at requests (to fetch the html and the file), Beautifulsoup (to parse the html and find the links)
requests has built in auth: http://docs.python-requests.org/en/latest/
Beautifulsoup is quite easy to use: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pseudocode: use request to download the sites html and auth. Go through the links by parsing. If a link meets the criteria -> save in a list, else continue. When all the links have been scrapped, go through them and download the file using requests (req = requests.get('url_to_file_here', auth={'username','password'}), if req.status_code in [200], file = req.text
If you can post the link of the site you want to download from, maybe we can do more.

Installing an editor plugin in web2py app

I am trying to install an editor plugin "ck-editor4" in my web2py app, following the steps at:
http://www.web2pyslices.com/slice/show/1952/ck-editor4-plugin
and
https://bitbucket.org/PhreeStyle/web2py_ckeditor/wiki/Home
I wrote the given piece of code in my application's "model/db1.y" and "views/default/index.html" as directed in the above links, but things are not working correct. I am a newbie in web2py. Please help me installing an editor (preferably which supports programming languages) in detailed steps. Thanks!
This worked for me:
in db.py:
from plugin_ckeditor import CKEditor
ckeditor = CKEditor(db)
db.define_table('wpage',
Field('title'),
Field('body', 'text',widget=ckeditor.widget),
auth.signature, # created_on, created_by, modified_on, modified_by, is_active
format='%(title)s')
In default.py:
#
auth.requires_login()
def create():
"""creates a new empty wiki page"""
ckeditor.define_tables()
form = SQLFORM(db.wpage).process(next=URL('index'))
return dict(form=form)
I used ckeditor.define_tables() in edit() and show() too. Now in show.html, display the formatting using:
{{=XML(page.body,sanitize=False)}}
This is all explained in the links in your post and
https://github.com/timrichardson/web2py_ckeditor4
By the way, I found a simple way to install an editor for web2py app. This is not ck-editor4 plugin but NicEdit. Just two lines of javascript code will do the job. Following link came as rescue to me.
http://nicedit.com/
Just follow the simple steps in "Quick Start Guide" right hand side. Create a textarea and your work is done.

HtmlUnit Axaj file download

I'm trying to download the XLS file from this page: http://www.nordpoolspot.com/Market-data1/Elspot/Area-Prices/ALL1/Hourly/ (click on "Export to XLS" link).
However doing:
page.getAnchorByText("Export to XLS").click().getWebResponse().getContentAsStream();
returns the html of the web page, instead of the expected file.
Do you have any suggestion?
I already tried the 3 points here http://htmlunit.sourceforge.net/faq.html#AJAXDoesNotWork without success.
The following:
webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
fixed my issue.

Resources