Using different Proxies per page with Puppeteer - proxy

Set Proxy per page in a Puppeteer Browser
Using a For of loop to create a new page for each automated instance, but after both pages load and take a screenshot, whatever is the first instance to start automating first, it takes over and only that automation takes place.
setting flags from what i've seen is only doable when creating a new browser
eg.
const browser = await puppeteer.launch({args:['--proxy-server=ip:port']});
cant seem to find any docs about setting it via the page.

I made a module that does this. It's called puppeteer-page-proxy.
It supports setting a proxy for an entire page, or if you like, it can set a different proxy for each request.
First install it:
npm i puppeteer-page-proxy
Then require it:
const useProxy = require('puppeteer-page-proxy');
Using it is easy;
Set proxy for an entire page:
await useProxy(page, 'http://127.0.0.1:8000');
If you want a different proxy for each request,then you can simply do this:
await page.setRequestInterception(true);
page.on('request', req => {
useProxy(req, 'socks5://127.0.0.1:9000');
});
Then if you want to be sure that your page's IP has changed, you can look it up;
const data = await useProxy.lookup(page);
console.log(data.ip);
It supports http, https, socks4 and socks5 proxies, and it also supports authentication if that is needed:
const proxy = 'http://login:pass#127.0.0.1:8000'
Repository:
https://github.com/Cuadrix/puppeteer-page-proxy

Related

Render different website based on http header user agent

I have created 2 version of the website, one for desktop and another for mobile. When user point their browser to www.example.com, based on the HTTP user agent, I want my server to serve them different website.
I don't want to use responsive design due to the fact my design and page layout as well as content are quite different between desktop and mobile. Furthermore, we may want to play around with search crawler by having another rule to open another plain HTML website.
I wonder can I configure such rule in my web server? or on Cloudflare?
You can detect the user's device by checking the User-Agent HTTP header for the first-time visitors OR a cookie for returning visitors, then use a Cloudfalre Worker script that would act as a reverse proxy redirecting requests to either desktop or mobile version of the website/app.
import isMobile from "ismobilejs";
export default {
fetch(req) {
const device = isMobile(req.headers.get("user-agent"));
// TODO: Also check cookies (for returning visitors)
const { pathname, search } = new URL(req.url);
const targetUrl = device.phone
? `https://m.example.com${pathname}${search}`
: `https://example.com${pathname}${search}`;
return fetch(targetUrl, req);
}
};
References
https://github.com/kaimallea/isMobile — User-Agent parser
https://github.com/kriasoft/cloudflare-starter-kit — Cloudflare Workers Starter Kit

Create a multi-website proxy with `http-proxy`

I'm using node-http-proxy to run a proxy website. I would like to proxy any target website that the user chooses, similarly to what's done by https://www.proxysite.com/, https://www.croxyproxy.com/ or https://hide.me/en/proxy.
How would one achieve this with node-http-proxy?
Idea #1: use a ?target= query param.
My first naive idea was to add a query param to the proxy, so that the proxy can read it and redirect.
Code-wise, it would more or less look like (assuming we're deploy this to http://myproxy.com):
const BASE_URL = 'https://myproxy.com';
// handler is the unique handler of all routes.
async function handler(
req: NextApiRequest,
res: NextApiResponse
): Promise<void> {
try {
const url = new URL(req.url, BASE_URL); // For example: `https://myproxy.com?target=https://google.com`
const targetURLStr = url.searchParams.get('target'); // Get `?target=` query param.
return httpProxyMiddleware(req, res, {
changeOrigin: true,
target: targetURLStr,
});
} catch (err) {
res.status(500).json({ error: (err as Error).message });
}
}
Problem: If I deploy this code to myproxy.com, and load https://myproxy.com?target=https://google.com, then google.com is loaded, but:
if I click a link to google images, it loads https://myproxy.com/images instead of https://myproxy.com?target=https://google.com/images, also see URL as query param in proxy, how to navigate?
Idea #2: use cookies
Second idea is to read the ?target= query param like above, store its hostname in a cookie, and proxy all resources to the cookie's hostname.
So for example user wants to access https://google.com/a/b?c=d via the proxy. The flow is:
go to https://myproxy.com?target=${encodeURIComponent('https://google.com/a/b?c=d')}
proxy reads the ?target= query param, sets the hostname (https://google.com) in a cookie
proxy redirects to https://myproxy.com/a/b?c=d (307 redirect)
proxy sees a new request, and since the cookie is set, we proxy this request into node-http-proxy using cookie's target.
Code-wise, it would look like: https://gist.github.com/throwaway34241/de8a623c1925ce0acd9d75ff10746275
Problem: This works very well. But only for one proxy at a time. If I open one browser tab with https://myproxy.com?target=https://google.com, and another tab with https://myproxy.com?target=https://facebook.com, then:
first it'll set the cookie to https://google.com, and i can navigate in the 1st tab correctly
then I go to the 2nd tab (without closing the 1st one), it'll set the cookie to https://facebook.com, and I can navigate facebook on the 2nd tab correctly
but then if I go back to the first tab, it'll proxy google resources through facebook, because the cookie has been overwritten.
I'm a bit out of ideas, and am wondering how those generic proxy websites are doing. Ideally, I would not want to parse the HTML of the target website.
The idea of a Proxy is to intercept the client requests, either by ports or by backend APIs, extract the URLs of requested resources, modify them and make those requests by self from servers, and modify responses and send them back to the client.
your first approach does this except modify responses and send back modified responses.
one way to do this is to edit all links in resources return by proxy to have your web address in them, only then send them as responses back to the client.
another way is to wrap the target site in a frame, as most web proxy sites do, and have a script to crawl the page and replace all links.
there is a small problem though. javascript-based requests are mostly hardcoded in the script and it is not an easy job to replace them.
your seconds approach sounds as if it would work better, but just a sound, nothing concrete I can say. implement a tab activity checker so you can change the cookie to your active tab. please check how-to-tell-if-browser-tab-is-active discussion about that

Swagger page being redirected from https to http

AWS Elastic Load Balancer listening through HTTPS (443) using SSL and redirecting requests to EC2 instances through HTTP (80), with IIS hosting a .net webapi application, using swashbuckle to describe the API methods.
Home page of the API (https://example.com) has a link to Swagger documentation which can bee read as https://example.com/swagger/ui/index.html when you hove over on the link.
If I click on the link it redirects the request on the browser to http://example.com/swagger/ui/index.html which displays a Page Not Found error
but if I type directly in the browser URL https://example.com/swagger/ui/index.html then it loads Swagger page, but then, when expanding the methods an click on "Try it out", the Request URL starts with "http" again.
This configuration is only for Stage and Production environments. Lower environments don't use the load balancer and just use http.
Any ideas on how to stop https being redirected to http? And how make swagger to display Request URLs using https?
Thank you
EDIT:
I'm using a custom index.html file
Seems is a known issue for Swashbuckle. Quote:
"By default, the service root url is inferred from the request used to access the docs. However, there may be situations (e.g. proxy and load-balanced environments) where this does not resolve correctly. You can workaround this by providing your own code to determine the root URL."
What I did was provide the root url and/or scheme to use based on the environment
GlobalConfiguration.Configuration
.EnableSwagger(c =>
{
...
c.RootUrl(req => GetRootUrlFromAppConfig(req));
...
c.Schemes(GetEnvironmentScheme());
...
})
.EnableSwaggerUi(c =>
{
...
});
where
public static string[] GetEnvironmentScheme()
{
...
}
public static string GetRootUrlFromAppConfig(HttpRequestMessage request)
{
...
}
The way I would probably do it is having a main file, and generating during the build of your application a different swagger file based on the environnement parameters for schemes and hosts.
That way, you have to manage only one swagger file accross your environments, and you only have to manage a few extra environnement properties, host and schemes (if you don't already have them)
Since I don't know about swashbuckle, I cannot answer for sure at your first question (the redirect)

How to use multiple proxy when crawling with scrapy + splash?

We crawl with scrapy + splash and we want to use multiple proxy. But splash only support single proxy https://splash.readthedocs.io/en/stable/api.html#proxy-profiles.
[proxy]
; required
host=proxy.crawlera.com
port=8010
; optional, default is no auth
username=username
password=password
; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP
How to use multiple proxy when crawling with scrapy + splash?
There are several options:
use multiple profiles (as Rafael Almeida suggested in comment);
pass a different proxy URL with each request (see http://splash.readthedocs.io/en/stable/api.html#arg-proxy);
write a Splash Lua script and use request:set_proxy in splash:on_request callback - there is an example in docs. This way you can set a different proxy for different requests initialted by a page, not only a single proxy per rendered page. I'm not aware of a way to do that in other browser automation tools like phantomjs or selenium.

chrome.webRequest.onHeadersReceived.addListener in firefox

I'm doing an extension for Firefox and I pick up the call to the URL, as I can capture the http request in firefox, when running a call to a URL.
For example in google chrome on the event: chrome.webRequest.onHeadersReceived.addListener (
Use plain XMLHttpRequest, which, when run from some chrome-privileged (system principal) place, allows to access all resources without obeying the same-origin policy, just like the SDK request module does not obey it.
SDK: in a lib/ module get it via
const {XMLHttpRequest} = require("sdk/net/xhr");
XUL overlays/windows, ChromeWorker: There already is a global XMLHttpRequest constructor.
JS code modules, etc: Components.classes["#mozilla.org/xmlextras/xmlhttprequest;1"].
createInstance(Components.interfaces.nsIXMLHttpRequest);
From there you can use onreadystatechange to look for a .readyState of HEADERS_RECEIVED. See the XMLHttpRequest docs.
To get cookies working for users with Deny Third-Party-Cookies you'll need to use forceAllowThirdPartyCookie in the SDK or otherwise:
if (xhr_instance.channel instanceof Components.interfaces.nsIHttpChannelInternal)
xhr_instance.channel.forceAllowThirdPartyCookie = true;

Resources