Puppeteer Random user-agent args - random

Recently I asked this random useragents from .json file but the thing is that after I added "capture screen" of puppeteer it keeps showing headless chrome, so I copied the previous topic answer into wrong place.
Now the real useragent js page came from this page:
const browser = await puppeteer.launch({
headless: false,
args: ['--headless', '--disable-infobars', '--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', '--no-sandbox', `--proxy-server=socks5://127.0.0.1:${port}`] : ['--no-sandbox'],
});
So how do I create rnadom list inside the arguments?
My previous help which doesnt worked for me as the random useragents code was not in correct place is here: Puppeteer browser useragent list
But adding that code inside this wont work.
So after --user-agent= I want to add "random" function but how?

You can use the user-agents module.
You should install npm install user-agents
const UserAgent = require("user-agents");
const userAgent = new UserAgent({
deviceCategory: "desktop",
platform: "Linux x86_64",
});
Then in the head "--user-agent=" + userAgent + "",

Related

Is there an update to open-uri that changes the way you call a User-Agent?

In the book "Instant Nokogiri" and on the Packt Hub Nokogiri page it has a User-Agent application for spoofing a browser while crawling the New York Times website for the top story.
I am working through this book but the code is a little dated, but I updated it.
My version of the code is:
require 'open-uri'
require 'nokogiri'
require 'sinatra'
browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1'
doc = Nokogiri::HTML(open ('http://nytimes.com', browser))
nyt_headline = doc.at_css('h2 span').content
nyt_url = "http://nytimes.com" + doc.at_css('.css-16ugw5f a')[:href]
html = "<h1>Nokogiri News Service</h1>"
html += "<h2>Top Story: #{nyt_headline}</h2>"
get '/' do
html
end
I am running this through a terminal session on Mac OS and getting this error:
invalid access mode Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) (ArgumentError)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1 (URI::HTTP resource is read only.)
I don't believe I am attempting to 'write'. Not sure why a 'read only' error would block this from running. It was working before I added the User Agent info.
See OpenURI's open documentation:
URI.open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo#bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
The options are a Hash. You're passing a String.

HTTParty request returns 404 code

I'm sending an HTTP request with the HTTParty Ruby gem with the following code:
require 'httparty'
require 'pry'
page = HTTParty.get('http://www.cubuffs.com/')
binding.pry
You can verify that the URL is valid. When exploring the results with Pry, I get the following:
[1] pry(main)> page
=> nil
[2] pry(main)> page.code
=> 404
[3] pry(main)> page.response
=> #<Net::HTTPNotFound 404 Not Found readbody=true>
I'm pretty sure nothing is wrong with my code, because I can substitute other URLs and they work as expected. For some reason, URLs from this domain return a 404 code. Any ideas what is wrong here and how to fix it?
The owner of that site is checking the User-Agent from the browser, and doesn't like the one that HTTParty is using. You can get the page by including a user agent header from a browser, here is the one from Chrome:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Modify your code as follows:
require 'httparty'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
page = HTTParty.get('http://www.cubuffs.com/', headers: {"User-Agent": user_agent})

bypass cloudflare protection with casperjs or phantomjs while using tor proxy

I use tor to access casperjs via this socks proxy
my OS windows 10 x64
my test.js
var casper = require('casper').create({
verbose: true,
logLevel: 'error',
pageSettings: {
loadImages: false, // The WebPage instance used by Casper will
loadPlugins: false, // use these settings
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
});
var caturl = ('http://www.test.com');
casper.start(caturl, function() {
this.echo(this.getTitle());
});
casper.run();
result from my local machine
casperjs test.js
This Is Page Title
when open tor, and I'm sure its working fine also the socks proxy is working tested it before
casperjs --proxy=127.0.0.1:9150 --proxy-type=socks5 test.js
Attention Required! | Cloudflare
the result as I see, that its want to solve recaptcha to open this site from cloudflare
BUT
when I open the tor browser, and open the link tested in casperjs, its open normally without any asking for recaptcha
WHY
when open the link with casperjs ask for recaptcha , and when open the link with tor browser (same proxy IP used) it doesn't ask for recaptcha ?
is this related with useragent or what ?

Ember isn't looking for my API

So I'm running an Ember-CLI and Rails 5 API-only app. It works fine in development when I use the --proxy http://localhost:3000 flag, but now I am trying to deploy to Heroku.
I have two sides: recipme-ember and recipme-rails. Feel free to explore the repos:
amclelland/fancy_recipme_frontend
amclelland/fancy_recipme
So after some struggling I have both sides deployed, but the Emberapp refuses to talk to the Rails app.
When I try to go to the Meal index I see this in the Heroku logs:
"GET /meals HTTP/1.1" 200 711 "http://recipme-ember.herokuapp.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/50.0.2661.102 Chrome/50.0.2661.102 Safari/537.36"
Looks like the Ember app is trying to get the /meals data from itself.
I have set API_URL env var for my Ember app to the Rails Heroku URL. Not sure if there's something else I need to set.
Thanks in advance!
You need to set the host property for your adapter:
// adapters/application.js
import DS from 'ember-data';
export default DS.RESTAdapter.extend({
host: 'http://yourapi.herokuapp.com',
});
If you need different hosts for development and production, you can use your config file to change it like shown here.

download image from url using python urllib but receiving HTTP Error 403: Forbidden

I want to download image file from a url using python module "urllib.request", which works for some website (e.g. mangastream.com), but does not work for another (mangadoom.co) receiving error "HTTP Error 403: Forbidden". What could be the problem for the latter case and how to fix it?
I am using python3.4 on OSX.
import urllib.request
# does not work
img_url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
At the end of error message it said:
...
HTTPError: HTTP Error 403: Forbidden
However, it works for another website
# work
img_url = 'http://img.mangastream.com/cdn/manga/51/3140/006.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
I have tried the solutions from the post below, but none of them works on mangadoom.co.
Downloading a picture via urllib and python
How do I copy a remote image in python?
The solution here also does not fit because my case is to download image.
urllib2.HTTPError: HTTP Error 403: Forbidden
Non-python solution is also welcome. Your suggestion will be very appreciated.
This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.
I advise for the use of the beautiful requests library, the code becomes (from here) :
import requests
import shutil
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png', stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Note that it seems this website does not forbide requests user-agent. But if need to be modified it is easy :
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
Also relevant : changing user-agent in urllib
You can build an opener. Here's the example:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
url=''
local=''
urllib.request.urlretrieve(url,local)
By the way, the following codes are the same:
(none-opener)
req=urllib.request.Request(url,data,hdr)
html=urllib.request.urlopen(req)
(opener builded)
html=operate.open(url,data,timeout)
However, we are not able to add header when we use:
urllib.request.urlretrieve()
So in this case, we have to build an opener.
I try wget with the url in terminal and it works:
wget -O out_005.png http://mangadoom.co/wp-content/manga/5170/886/005.png
so my way around is to use the script below, and it works too.
import os
out_image = 'out_005.png'
url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
os.system("wget -O {0} {1}".format(out_image, url))

Resources