My WebScraper uses urllib to get data from sites like YouTube. But I often run into a problem when there are too many requests, resulting in the site blocking my connection.
So my question is, is there a way, in Python, to bypass this?
Such as changing the IP address (via some native Socket module, or os.system("some command like netsh"), a simple API that doesn't require authentication (like oAuth or a Key), or simply using a web-based proxy to divert my traffic?
search_url = "https://www.youtube.com/results?search_query=" #Search URL
bypass_url = "https://someProxy.com/url=" + search_url
for video_ID in raw_video_list:
raw_html = self.ReadHTML( search_url + video_ID ) #Returns raw HTML
# Then the program does it's magic with that html
That is just a basic idea of the program, but it'll iterate a block like that over a hundred times.
Using Python 2.7, Windows 8, native modules
Related
We crawl with scrapy + splash and we want to use multiple proxy. But splash only support single proxy https://splash.readthedocs.io/en/stable/api.html#proxy-profiles.
[proxy]
; required
host=proxy.crawlera.com
port=8010
; optional, default is no auth
username=username
password=password
; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP
How to use multiple proxy when crawling with scrapy + splash?
There are several options:
use multiple profiles (as Rafael Almeida suggested in comment);
pass a different proxy URL with each request (see http://splash.readthedocs.io/en/stable/api.html#arg-proxy);
write a Splash Lua script and use request:set_proxy in splash:on_request callback - there is an example in docs. This way you can set a different proxy for different requests initialted by a page, not only a single proxy per rendered page. I'm not aware of a way to do that in other browser automation tools like phantomjs or selenium.
I want to disable sessions for headless API endpoints, but I have to keep them turned on because this service also handles user logins.
However makeSessionBackend doesn't have access to Handler stuff or even current URI, like isAuthorizedSource does.
It appears to me that I should lift Client Session Backend code and sprinkle it with wrappers until the moment I can get at least textual path from that WAI Request.
Isn't there a better way to tell any bakend to ignore some routes like StaticR?
All of your points can be modified by overriding the makeSessionBackend method in the Yesod typeclass. Something like
instance Yesod App where
makeSessionBackend _ = fmap Just $ defaultClientSessionBackend expireTime filepath
where expireTime = 24 * 60
A django web app needs to make ajax calls to an external url. In development I serve directly from django so I have a cross domain problem. What is the django way to write a proxy for the ajax call?
Here's a dead simple proxy implementation for Django.
from django.http import HttpResponse
import mimetypes
import urllib2
def proxy_to(request, path, target_url):
url = '%s%s' % (target_url, path)
if request.META.has_key('QUERY_STRING'):
url += '?' + request.META['QUERY_STRING']
try:
proxied_request = urllib2.urlopen(url)
status_code = proxied_request.code
mimetype = proxied_request.headers.typeheader or mimetypes.guess_type(url)
content = proxied_request.read()
except urllib2.HTTPError as e:
return HttpResponse(e.msg, status=e.code, mimetype='text/plain')
else:
return HttpResponse(content, status=status_code, mimetype=mimetype)
This proxies requests from PROXY_PATH+path to TARGET_URL+path.
The proxy is enabled and configured by adding a URL pattern like this to urls.py:
url(r'^PROXY_PATH/(?P<path>.*)$', proxy_to, {'target_url': 'TARGET_URL'}),
For example:
url(r'^images/(?P<path>.*)$', proxy_to, {'target_url': 'http://imageserver.com/'}),
will make a request to http://localhost:8000/images/logo.png fetch and return the file at http://imageserver.com/logo.png.
Query strings are forwarded, while HTTP headers such as cookies and POST data are not (it's quite easy to add that if you need it).
Note: This is mainly intended for development use. The proper way to handle proxying in production is with the HTTP server (e.g. Apache or Nginx).
I ran across this question while trying to answer it myself, and found this Django app:
http://httpproxy.yvandermeer.net/
...which is a little heavyweight for what I needed (recording and playback, requires a syncdb to add in model stuff). But you can see the code it uses in its generic proxying view, which is based on httplib2:
http://bitbucket.org/yvandermeer/django-http-proxy/src/1776d5732113/httpproxy/views.py
Am I right that you are asking about how to write view in Django that could accept incoming AJAX request, issue request to the remote server and then return received response to browser?
If so, then it's not really Django-specific question - remote calls could be done with Python's urllib2 or httplib, and then you just have to put:
return HttpResponse(received_response)
-- in your Django proxy-view. I assume no reponse processing here, because if it's just a proxy for AJAX call then JavaScript expects unprocessed data.
I am working on a website hosted on microsoft's office live service. It has a contact form enabling visitors to get in touch with the owner. I want to write a Ruby script that sits on a seperate sever and which the form will POST to. It will parse the form data and email the details to a preset address. The script should then redirect the browser to a confirmation page.
I have an ubuntu hardy machine running nginx and postfix. Ruby is installed and we shall see about using Thin and it's Rack functionality to handle the script. Now it's come to writing the script and i've drawn a blank.
It's been a long time and if i remember rightly the process is something like;
read HTTP header
parse parameters
send email
send redirect header
Broadly speaking, the question has been answered. Figuring out how to use the answer was more complicated than expected and I thought worth sharing.
First Steps:
I learnt rather abruptly that nginx doesn't directly support cgi scripts. You have to use some other process to run the script and get nginx to proxy requests over. If I was doing this in php (which in hind sight i think would have been a more natural choice) i could use something like php-fcgi and expect life would be pretty straight forward.
Ruby and fcgi felt pretty daunting. But if we are abandoning the ideal of loading these things at runtime then Rack is probably the most straight forward solution and Thin includes all we need. Learning how to make basic little apps with them has been profoundly beneficial to a relative Rails newcomer like me. The foundations of a Rails app can seem hidden for a long time and Rack has helped me lift the curtain that little bit further.
Nonetheless, following Yehuda's advice and looking up sinatra has been another surprise. I now have a basic sinatra app running in a Thin instance. It communicates with nginx over a unix socket in what i gather is the standard way. Sinatra enables a really elegant way to handle different requests and routes into the app. All you need is a get '/' {} to start handling requests to the virtual host. To add more (in a clean fashion) we just include a routes/script.rb into the main file.
# cgi-bin.rb
# main file loaded as a sinatra app
require 'sinatra'
# load cgi routes
require 'routes/default'
require 'routes/contact'
# 404 behaviour
not_found do
"Sorry, this CGI host does not recognize that request."
end
These route files will call on functionality stored in a separate library of classes:
# routes/contact.rb
# contact controller
require 'lib/contact/contactTarget'
require 'lib/contact/contactPost'
post '/contact/:target/?' do |target|
# the target for the message is taken from the URL
msg = ContactPost.new(request, target)
redirect msg.action, 302
end
The sheer horror of figuring out such a simple thing will stay with me for a while. I was expecting to calmly let nginx know that .rb files were to be executed and to just get on with it. Now that this little sinatra app is up and running, I'll be able to dive straight in if I want to add extra functionality in the future.
Implementation:
The ContactPost class handles the messaging aspect. All it needs to know are the parameters in the request and the target for the email. ContactPost::action kicks everything off and returns an address for the controller to redirect to.
There is a separate ContactTarget class that does some authentication to make sure the specified target accepts messages from the URL given in request.referrer. This is handled in ContactTarget::accept? as we can guess from the ContactPost::action method;
# lib/contact/contactPost.rb
class ContactPost
# ...
def action
return failed unless #target.accept? #request.referer
if send?
successful
else
failed
end
end
# ...
end
ContactPost::successful and ContactPost::failed each return a redirect address by combining paths supplied with the HTML form with the request.referer URI. All the behaviour is thus specified in the HTML form. Future websites that use this script just need to be listed in the user's own ~/cgi/contact.conf and they'll be away. This is because ContactTarget looks in /home/:target/cgi/contact.conf for the details. Maybe oneday this will be inappropriate, but for now it's just fine for my purposes.
The send method is simple enough, it creates an instance of a simple Email class and ships it out. The Email class is pretty much based on the standard usage example given in the Ruby net/smtp documentation;
# lib/email/email.rb
require 'net/smtp'
class Email
def initialize(from_alias, to, reply, subject, body)
#from_alias = from_alias
#from = "cgi_user#host.domain.com"
#to = to
#reply = reply
#subject = subject
#body = body
end
def send
Net::SMTP.start('localhost', 25) do |smtp|
smtp.send_message to_s, #from, #to
end
end
def to_s
<<END_OF_MESSAGE
From: #{#from_alias}
To: #{#to}
Reply-To: #{#from_alias}
Subject: #{#subject}
Date: #{DateTime::now().to_s}
#{#body}
END_OF_MESSAGE
end
end
All I need to do is rack up the application, let nginx know which socket to talk to and we're away.
Thank you everyone for your helpful pointers in the right direction! Long live sinatra!
It's all in the Net module, here's an example:
#net = Net::HTTP.new 'http://www.foo.com', 80
#params = {:name => 'doris', :email => 'doris#foo.com'}
# Create HTTP request
req = Net::HTTP::Post.new( 'script.cgi', {} )
req.set_form_data #params
# Send request
response = #net.start do |http|
http.read_timeout = 5600
http.request req
end
Probably the best way to do this would be to use an existing Ruby library like Sinatra:
require "rubygems"
require "sinatra"
get "/myurl" do
# params hash available here
# send email
end
You'll probably want to use MailFactory to send the actual email, but you definitely don't need to be mucking about with headers or parsing parameters.
CGI class of Ruby can be used for writing CGI scripts. Please check: http://www.ruby-doc.org/stdlib/libdoc/cgi/rdoc/index.html
By the way, there is no need to read the HTTP header. Parsing parametres will be easy using CGI class. Then, send the e-mail and redirect.
I have to access some pages at work and then log into them to report any problems. I was thinking of writing a program to do this.
First, I have to be able to access the pages, then I have to locate the login form and send the info. Currently, I plan on printing true/false for each test (accessibility and login) and then filling the forms myself. I'm hoping to be able to write something to automate this later.
I was thinking of using Ruby, although I haven't coded in it yet, it seems like it'd make the whole thing easier. I've worked the most with Java, though I have some experience with C++ and a bit of experience with C.
Any advice?
You can use Selenium IDE. It is a record and playback tool for simple web tests, which you can then save as test for Selenium RC in any language you want. I hope it helps
The Python urllib2 module easily permit you to interact with an HTTP server. You can use urrlib2 to read the page to verify the content. You can do a POST with the urlencoded form data and verify the content.
Further, Python has a simple unittest library that will help you structure your tests.
class TestForm( unittest.TestCase ):
def testFillInForm( self ):
data= urllib.urlencode( { field1="value", field2="value" } )
response= urllib2.urlopen( "http://localhost/path/to/form", data )
# check the response
if __name__ == "__main__":
unittest.main()
Ruby, PHP and Python all have easy to use HTTP libraries which make this kind of an operation pretty easy. Any of these languages would work fine.
If you want to do this is ruby, The Mechanize gem would be perfect for this
`
require 'mechanize'
agent = WWW::MECHANIZE.new
page = agent.get('localhost/path/to/form')
login_form = page.forms.first #assuming the first form is the one we want
login_form.username = 'myusername'
login_form.password = 'mypassword'
page = agent.submit(login_form)
puts page.body # just to see the results
`
I have found CURL to be really useful and easy to use as well under PHP. Easy to learn.
Handles cookies, HTTPS, etc.
All good.