Recently I asked this random useragents from .json file but the thing is that after I added "capture screen" of puppeteer it keeps showing headless chrome, so I copied the previous topic answer into wrong place.
Now the real useragent js page came from this page:
const browser = await puppeteer.launch({
headless: false,
args: ['--headless', '--disable-infobars', '--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', '--no-sandbox', `--proxy-server=socks5://127.0.0.1:${port}`] : ['--no-sandbox'],
});
So how do I create rnadom list inside the arguments?
My previous help which doesnt worked for me as the random useragents code was not in correct place is here: Puppeteer browser useragent list
But adding that code inside this wont work.
So after --user-agent= I want to add "random" function but how?
You can use the user-agents module.
You should install npm install user-agents
const UserAgent = require("user-agents");
const userAgent = new UserAgent({
deviceCategory: "desktop",
platform: "Linux x86_64",
});
Then in the head "--user-agent=" + userAgent + "",
Related
In the book "Instant Nokogiri" and on the Packt Hub Nokogiri page it has a User-Agent application for spoofing a browser while crawling the New York Times website for the top story.
I am working through this book but the code is a little dated, but I updated it.
My version of the code is:
require 'open-uri'
require 'nokogiri'
require 'sinatra'
browser = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1'
doc = Nokogiri::HTML(open ('http://nytimes.com', browser))
nyt_headline = doc.at_css('h2 span').content
nyt_url = "http://nytimes.com" + doc.at_css('.css-16ugw5f a')[:href]
html = "<h1>Nokogiri News Service</h1>"
html += "<h2>Top Story: #{nyt_headline}</h2>"
get '/' do
html
end
I am running this through a terminal session on Mac OS and getting this error:
invalid access mode Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) (ArgumentError)
AppleWebKit/536.30.1 (KHTML, like Gecko) Version/6.0.5 Safari/536.30.1 (URI::HTTP resource is read only.)
I don't believe I am attempting to 'write'. Not sure why a 'read only' error would block this from running. It was working before I added the User Agent info.
See OpenURI's open documentation:
URI.open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo#bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
The options are a Hash. You're passing a String.
I'm sending an HTTP request with the HTTParty Ruby gem with the following code:
require 'httparty'
require 'pry'
page = HTTParty.get('http://www.cubuffs.com/')
binding.pry
You can verify that the URL is valid. When exploring the results with Pry, I get the following:
[1] pry(main)> page
=> nil
[2] pry(main)> page.code
=> 404
[3] pry(main)> page.response
=> #<Net::HTTPNotFound 404 Not Found readbody=true>
I'm pretty sure nothing is wrong with my code, because I can substitute other URLs and they work as expected. For some reason, URLs from this domain return a 404 code. Any ideas what is wrong here and how to fix it?
The owner of that site is checking the User-Agent from the browser, and doesn't like the one that HTTParty is using. You can get the page by including a user agent header from a browser, here is the one from Chrome:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
Modify your code as follows:
require 'httparty'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
page = HTTParty.get('http://www.cubuffs.com/', headers: {"User-Agent": user_agent})
I use tor to access casperjs via this socks proxy
my OS windows 10 x64
my test.js
var casper = require('casper').create({
verbose: true,
logLevel: 'error',
pageSettings: {
loadImages: false, // The WebPage instance used by Casper will
loadPlugins: false, // use these settings
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
});
var caturl = ('http://www.test.com');
casper.start(caturl, function() {
this.echo(this.getTitle());
});
casper.run();
result from my local machine
casperjs test.js
This Is Page Title
when open tor, and I'm sure its working fine also the socks proxy is working tested it before
casperjs --proxy=127.0.0.1:9150 --proxy-type=socks5 test.js
Attention Required! | Cloudflare
the result as I see, that its want to solve recaptcha to open this site from cloudflare
BUT
when I open the tor browser, and open the link tested in casperjs, its open normally without any asking for recaptcha
WHY
when open the link with casperjs ask for recaptcha , and when open the link with tor browser (same proxy IP used) it doesn't ask for recaptcha ?
is this related with useragent or what ?
So I'm running an Ember-CLI and Rails 5 API-only app. It works fine in development when I use the --proxy http://localhost:3000 flag, but now I am trying to deploy to Heroku.
I have two sides: recipme-ember and recipme-rails. Feel free to explore the repos:
amclelland/fancy_recipme_frontend
amclelland/fancy_recipme
So after some struggling I have both sides deployed, but the Emberapp refuses to talk to the Rails app.
When I try to go to the Meal index I see this in the Heroku logs:
"GET /meals HTTP/1.1" 200 711 "http://recipme-ember.herokuapp.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/50.0.2661.102 Chrome/50.0.2661.102 Safari/537.36"
Looks like the Ember app is trying to get the /meals data from itself.
I have set API_URL env var for my Ember app to the Rails Heroku URL. Not sure if there's something else I need to set.
Thanks in advance!
You need to set the host property for your adapter:
// adapters/application.js
import DS from 'ember-data';
export default DS.RESTAdapter.extend({
host: 'http://yourapi.herokuapp.com',
});
If you need different hosts for development and production, you can use your config file to change it like shown here.
I want to download image file from a url using python module "urllib.request", which works for some website (e.g. mangastream.com), but does not work for another (mangadoom.co) receiving error "HTTP Error 403: Forbidden". What could be the problem for the latter case and how to fix it?
I am using python3.4 on OSX.
import urllib.request
# does not work
img_url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
At the end of error message it said:
...
HTTPError: HTTP Error 403: Forbidden
However, it works for another website
# work
img_url = 'http://img.mangastream.com/cdn/manga/51/3140/006.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
I have tried the solutions from the post below, but none of them works on mangadoom.co.
Downloading a picture via urllib and python
How do I copy a remote image in python?
The solution here also does not fit because my case is to download image.
urllib2.HTTPError: HTTP Error 403: Forbidden
Non-python solution is also welcome. Your suggestion will be very appreciated.
This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.
I advise for the use of the beautiful requests library, the code becomes (from here) :
import requests
import shutil
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png', stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Note that it seems this website does not forbide requests user-agent. But if need to be modified it is easy :
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
Also relevant : changing user-agent in urllib
You can build an opener. Here's the example:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
url=''
local=''
urllib.request.urlretrieve(url,local)
By the way, the following codes are the same:
(none-opener)
req=urllib.request.Request(url,data,hdr)
html=urllib.request.urlopen(req)
(opener builded)
html=operate.open(url,data,timeout)
However, we are not able to add header when we use:
urllib.request.urlretrieve()
So in this case, we have to build an opener.
I try wget with the url in terminal and it works:
wget -O out_005.png http://mangadoom.co/wp-content/manga/5170/886/005.png
so my way around is to use the script below, and it works too.
import os
out_image = 'out_005.png'
url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
os.system("wget -O {0} {1}".format(out_image, url))