How to use multiple proxy when crawling with scrapy + splash? - proxy

We crawl with scrapy + splash and we want to use multiple proxy. But splash only support single proxy https://splash.readthedocs.io/en/stable/api.html#proxy-profiles.
[proxy]
; required
host=proxy.crawlera.com
port=8010
; optional, default is no auth
username=username
password=password
; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP
How to use multiple proxy when crawling with scrapy + splash?

There are several options:
use multiple profiles (as Rafael Almeida suggested in comment);
pass a different proxy URL with each request (see http://splash.readthedocs.io/en/stable/api.html#arg-proxy);
write a Splash Lua script and use request:set_proxy in splash:on_request callback - there is an example in docs. This way you can set a different proxy for different requests initialted by a page, not only a single proxy per rendered page. I'm not aware of a way to do that in other browser automation tools like phantomjs or selenium.

Related

Using different Proxies per page with Puppeteer

Set Proxy per page in a Puppeteer Browser
Using a For of loop to create a new page for each automated instance, but after both pages load and take a screenshot, whatever is the first instance to start automating first, it takes over and only that automation takes place.
setting flags from what i've seen is only doable when creating a new browser
eg.
const browser = await puppeteer.launch({args:['--proxy-server=ip:port']});
cant seem to find any docs about setting it via the page.
I made a module that does this. It's called puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy

Python 2.7 - Send multiple requests to server Without getting blocked (DOS)

My WebScraper uses urllib to get data from sites like YouTube. But I often run into a problem when there are too many requests, resulting in the site blocking my connection.
So my question is, is there a way, in Python, to bypass this?
Such as changing the IP address (via some native Socket module, or os.system("some command like netsh"), a simple API that doesn't require authentication (like oAuth or a Key), or simply using a web-based proxy to divert my traffic?
search_url = "https://www.youtube.com/results?search_query=" #Search URL
bypass_url = "https://someProxy.com/url=" + search_url
for video_ID in raw_video_list:
raw_html = self.ReadHTML( search_url + video_ID ) #Returns raw HTML
# Then the program does it's magic with that html
That is just a basic idea of the program, but it'll iterate a block like that over a hundred times.
Using Python 2.7, Windows 8, native modules

Swagger page being redirected from https to http

AWS Elastic Load Balancer listening through HTTPS (443) using SSL and redirecting requests to EC2 instances through HTTP (80), with IIS hosting a .net webapi application, using swashbuckle to describe the API methods.
Home page of the API (https://example.com) has a link to Swagger documentation which can bee read as https://example.com/swagger/ui/index.html when you hove over on the link.
If I click on the link it redirects the request on the browser to http://example.com/swagger/ui/index.html which displays a Page Not Found error
but if I type directly in the browser URL https://example.com/swagger/ui/index.html then it loads Swagger page, but then, when expanding the methods an click on "Try it out", the Request URL starts with "http" again.
This configuration is only for Stage and Production environments. Lower environments don't use the load balancer and just use http.
Any ideas on how to stop https being redirected to http? And how make swagger to display Request URLs using https?
Thank you
EDIT:
I'm using a custom index.html file
Seems is a known issue for Swashbuckle. Quote:
"By default, the service root url is inferred from the request used to access the docs. However, there may be situations (e.g. proxy and load-balanced environments) where this does not resolve correctly. You can workaround this by providing your own code to determine the root URL."
What I did was provide the root url and/or scheme to use based on the environment
GlobalConfiguration.Configuration
.EnableSwagger(c =>
{
...
c.RootUrl(req => GetRootUrlFromAppConfig(req));
...
c.Schemes(GetEnvironmentScheme());
...
})
.EnableSwaggerUi(c =>
{
...
});
where
public static string[] GetEnvironmentScheme()
{
...
}
public static string GetRootUrlFromAppConfig(HttpRequestMessage request)
{
...
}
The way I would probably do it is having a main file, and generating during the build of your application a different swagger file based on the environnement parameters for schemes and hosts.
That way, you have to manage only one swagger file accross your environments, and you only have to manage a few extra environnement properties, host and schemes (if you don't already have them)
Since I don't know about swashbuckle, I cannot answer for sure at your first question (the redirect)

Creating custom header variable using Fiddler proxy server

I understand that Fiddler proxy server supports custom HTTP header variables. I'm new at this, my goal is to simulate a passed custom variable the way an F5 appliance would pass an HTTP header variable (string) for a web app. The variable string is used to authenticate the user. I do not have access to a load balancing appliance and have to find a way to simualte or manually add it.
Will apreciate any input on how to accomplish this...
You can add your own header variables to the OnBeforeRequest function (or OnBeforeResponse) within CustomRules.js, such as the following:
oSession.oRequest["NewHeaderName"] = "New header value";
More info: http://fiddler2.com/documentation/KnowledgeBase/FiddlerScript/ModifyRequestOrResponse

How can I open a list of URLs on Windows

I'm looking for a way to open a list of URLs in all of my browsers ( Firefox, Chrome, and IE ) on Windows using a scriptable shell such as Powershell or Cygwin.
Ideally I should be able to type in a list of URLs as arguments to the command, i.e. `openUrl http://example.net http://example2.net http://example3.com...
I would also need this script to pass authentication info into the http header (encoded usename and password).
With chrome it's not hard.
$chrome = (gi ~\AppData\Local\Google\Chrome\Application\chrome.exe ).FullName
$urls = "stackoverflow.com","slate.com"
$urls | % { & $chrome $_ }
First, how to open URLs in PowerShell. In PowerShell open a URL is very simple, just use start
start http://your.url.com
I think you can simple use foreach to handle the list of URLs.
Second, pass authentication via URL. There is a standard way for HTTP based authentication. (not HTML form based). You could construct the URL like:
http://username:password#your.url.com
Again, it only works for HTTP based authentication.
Look at HKCR\http\shell\open\command how each browser handles urls. Then just use the normal methods to launch the browsers with appropriate urls.

Resources