HTTP Basic Authentication with Anemone Web Spider - ruby

I need collect all "title" from all pages from site.
Site have HTTP Basic Auth configuration.
Without auth I do next:
require 'anemone'
Anemone.crawl("http://example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.doc.at('title').inner_html rescue nil
end
end
But I have some problem with HTTP Basic Auth...
How I can collected titles from site with HTTP Basic Auth?
If I try use "Anemone.crawl("http://username:password#example.com/")" then I have only first page title, but other links have http://example.com/ style and I received 401 error.

HTTP Basic Auth works via HTTP headers. Client, willing to access restricted resource, must provide authentication header, like this one:
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
It contains name and password, Base64-encoded. More info is in Wikipedia article: Basic Access Authentication.
I googled a little bit and didn't find a way to make Anemone accept custom request headers. Maybe you'll have more luck.
But I found another crawler that claims it can do it: Messie. Maybe you should give it a try
Update
Here's the place where Anemone sets its request headers: Anemone::HTTP. Indeed, there's no customization there. You can monkeypatch it. Something like this should work (put this somewhere in your app):
module Anemone
class HTTP
def get_response(url, referer = nil)
full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"
opts = {}
opts['User-Agent'] = user_agent if user_agent
opts['Referer'] = referer.to_s if referer
opts['Cookie'] = #cookie_store.to_s unless #cookie_store.empty? || (!accept_cookies? && #opts[:cookies].nil?)
retries = 0
begin
start = Time.now()
# format request
req = Net::HTTP::Get.new(full_path, opts)
response = connection(url).request(req)
finish = Time.now()
# HTTP Basic authentication
req.basic_auth 'your username', 'your password' # <<== tweak here
response_time = ((finish - start) * 1000).round
#cookie_store.merge!(response['Set-Cookie']) if accept_cookies?
return response, response_time
rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
puts e.inspect if verbose?
refresh_connection(url)
retries += 1
retry unless retries > 3
end
end
end
end
Obviously, you should provide your own values for the username and password params to the basic_auth method call. It's quick and dirty and hardcode, yes. But sometimes you don't have time to do things in a proper manner. :)

Related

Net::HTTP Proxy list

I understand that you could use proxy in the ruby Net::HTTP. However, I have no idea how to do this with a bunch of proxy. I need the Net::HTTP to change to another proxy and send another post request after every post request. Also, is it possible to make the Net::HTTP to change to another proxy if the previous proxy is not working? If so, how?
Code I'm trying to implement the script in:
require 'net/http'
sleep(8)
http = Net::HTTP.new('URLHERE', 80)
http.read_timeout = 5000
http.use_ssl = false
path = 'PATHHERE'
data = '(DATAHERE)'
headers = {
'Referer' => 'REFERER HERE',
'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent' => '(USERAGENTHERE)'}
resp, data = http.post(path, data, headers)
# Output on the screen -> we should get either a 302 redirect (after a successful login) or an error page
puts 'Code = ' + resp.code
puts 'Message = ' + resp.message
resp.each {|key, val| puts key + ' = ' + val}
puts data
end
Given an array of proxies, the following example will make a request through each proxy in the array until it receives a "302 Found" response. (This isn't actually a working example because Google doesn't accept POST requests, but it should work if you insert your own destination and working proxies.)
require 'net/http'
destination = URI.parse "http://www.google.com/search"
proxies = [
"http://proxy-example-1.net:8080",
"http://proxy-example-2.net:8080",
"http://proxy-example-3.net:8080"
]
# Create your POST request_object once
request_object = Net::HTTP::Post.new(destination.request_uri)
request_object.set_form_data({"q" => "stack overflow"})
proxies.each do |raw_proxy|
proxy = URI.parse raw_proxy
# Create a new http_object for each new proxy
http_object = Net::HTTP.new(destination.host, destination.port, proxy.host, proxy.port)
# Make the request
response = http_object.request(request_object)
# If we get a 302, report it and break
if response.code == "302"
puts "#{proxy.host}:#{proxy.port} responded with #{response.code} #{response.message}"
break
end
end
You should also probably do some error checking with begin ... rescue ... end each time you make a request. If you don't do any error checking and a proxy is down, control will never reach the line that checks for response.code == "302" -- the program will just fail with some type of connection timeout error.
See the Net::HTTPHeader docs for other methods that can be used to customize the Net::HTTP::Post object.

Ruby HTTP post with session cookie

I'm trying to write a Ruby script to use the API on the image gallery site Piwigo, this requires you to login first with one HTTP post and upload an image with another post.
This is what I've got so far but it doesn't work, just returns a 401 error, can anyone see where I am going wrong?
require 'net/http'
require 'pp'
http = Net::HTTP.new('mydomain.com',80)
path = '/piwigo/ws.php'
data = 'method=pwg.session.login&username=admin&password=password'
resp, data = http.post(path, data, {})
if (resp.code == '200')
cookie = resp.response['set-cookie']
data = 'method=pwg.images.addSimple&image=image.jpg&category=7'
headers = { "Cookie" => cookie }
resp, data = http.post(path, data, headers)
puts resp.code
puts resp.message
end
Which gives this response when run;
$ ruby piwigo.rb
401
Unauthorized
There is a Perl example on their API page which I was trying to convert to Ruby http://piwigo.org/doc/doku.php?id=dev:webapi:pwg.images.addsimple
By using the nice_http gem: https://github.com/MarioRuiz/nice_http
NiceHttp will take care of your cookies so you don't have to do anything
require 'nice_http'
path = '/piwigo/ws.php'
data = '?method=pwg.session.login&username=admin&password=password'
http = NiceHttp.new('http://example.com')
resp = http.get(path+data)
if resp.code == 200
resp = http.post(path)
puts resp.code
puts resp.message
end
Also if you want you can add your own cookies by using http.cookies
You can use a gem called mechanize. It handles cookies transparently.

User-Agent in HTTP requests, Ruby

I'm pretty new to Ruby. I've tried looking over the online documentation, but I haven't found anything that quite works. I'd like to include a User-Agent in the following HTTP requests, bot get_response() and get(). Can someone point me in the right direction?
# Preliminary check that Proggit is up
check = Net::HTTP.get_response(URI.parse(proggit_url))
if check.code != "200"
puts "Error contacting Proggit"
return
end
# Attempt to get the json
response = Net::HTTP.get(URI.parse(proggit_url))
if response.nil?
puts "Bad response when fetching Proggit json"
return
end
Amir F is correct, that you may enjoy using another HTTP client like RestClient or Faraday, but if you wanted to stick with the standard Ruby library you could set your user agent like this:
url = URI.parse(proggit_url)
req = Net::HTTP::Get.new(proggit_url)
req.add_field('User-Agent', 'My User Agent Dawg')
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
res.body
Net::HTTP is very low level, I would recommend using the rest-client gem - it will also follows redirects automatically and be easier for you to work with, i.e:
require 'rest_client'
response = RestClient.get proggit_url
if response.code != 200
# do something
end

Ruby Net::HTTP - following 301 redirects

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

How to implement cookie support in ruby net/http?

I'd like to add cookie support to a ruby class utilizing net/http to browse the web. Cookies have to be stored in a file to survive after the script has ended. Of course I can read the specs and write some kind of a handler, use some cookie.txt format and so on, but it seems to mean reinventing the wheel. Is there a better way to accomplish this task? Maybe some kind of a cooie jar class to take care of cookies?
The accepted answer will not work if your server returns and expects multiple cookies. This could happen, for example, if the server returns a set of FedAuth[n] cookies. If this affects you, you might want to look into using something along the lines of the following instead:
http = Net::HTTP.new('https://example.com', 443)
http.use_ssl = true
path1 = '/index.html'
path2 = '/index2.html'
# make a request to get the server's cookies
response = http.get(path)
if (response.code == '200')
all_cookies = response.get_fields('set-cookie')
cookies_array = Array.new
all_cookies.each { | cookie |
cookies_array.push(cookie.split('; ')[0])
}
cookies = cookies_array.join('; ')
# now make a request using the cookies
response = http.get(path2, { 'Cookie' => cookies })
end
Taken from DZone Snippets
http = Net::HTTP.new('profil.wp.pl', 443)
http.use_ssl = true
path = '/login.html'
# GET request -> so the host can set his cookies
resp, data = http.get(path, nil)
cookie = resp.response['set-cookie'].split('; ')[0]
# POST request -> logging in
data = 'serwis=wp.pl&url=profil.html&tryLogin=1&countTest=1&logowaniessl=1&login_username=blah&login_password=blah'
headers = {
'Cookie' => cookie,
'Referer' => 'http://profil.wp.pl/login.html',
'Content-Type' => 'application/x-www-form-urlencoded'
}
resp, data = http.post(path, data, headers)
# Output on the screen -> we should get either a 302 redirect (after a successful login) or an error page
puts 'Code = ' + resp.code
puts 'Message = ' + resp.message
resp.each {|key, val| puts key + ' = ' + val}
puts data
update
#To save the cookies, you can use PStore
cookies = PStore.new("cookies.pstore")
# Save the cookie
cookies.transaction do
cookies[:some_identifier] = cookie
end
# Retrieve the cookie back
cookies.transaction do
cookie = cookies[:some_identifier]
end
The accepted answer does not work. You need to access the internal representation of the response header where the multiple set-cookie values are stores separately and then remove everything after the first semicolon from these string and join them together. Here is code that works
r = http.get(path)
cookie = {'Cookie'=>r.to_hash['set-cookie'].collect{|ea|ea[/^.*?;/]}.join}
r = http.get(next_path,cookie)
Use http-cookie, which implements RFC-compliant parsing and rendering, plus a jar.
A crude example that happens to follow a redirect post-login:
require 'uri'
require 'net/http'
require 'http-cookie'
uri = URI('...')
jar = HTTP::CookieJar.new
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
req = Net::HTTP::Post.new uri
req.form_data = { ... }
res = http.request req
res.get_fields('Set-Cookie').each do |value|
jar.parse(value, req.uri)
end
fail unless res.code == '302'
req = Net::HTTP::Get.new(uri + res['Location'])
req['Cookie'] = HTTP::Cookie.cookie_value(jar.cookies(uri))
res = http.request req
end
Why do this? Because the answers above are incredibly insufficient and flat out don't work in many RFC-compliant scenarios (happened to me), so relying on the very lib implementing just what's needed is infinitely more robust if you want to handle more than one particular case.
I've used Curb and Mechanize for a similar project.
Just enable cookies support and save the cookies to a temp cookiejar...
If your using net/http or packages without cookie support built in, you will need to write your own cookie handling.
You can send receive cookies using headers.
You can store the header in any persistence framework. Whether it is some sort of database, or files.

Resources