Mechanize: Avoid webserver locking - ruby

I have this code:
#!/bin/env ruby
# encoding: utf-8
require 'mechanize'
begin
agent = Mechanize.new
agent.robots = false
agent.user_agent_alias = 'Mac Safari'
url = "http://www.paris.cl/tienda/es/paris/computacion/tablet/tablet-acer--iconia-b1-710-l688-7-342750-ppp-"
website = agent.get(url)
rescue Exception => e
puts "Error : " + e.message
end
This try to get a website, but I get this error:
Error : 403 => Net::HTTPForbidden for http://www.paris.cl/tienda/es/paris/computacion/tablet/tablet-acer--iconia-b1-710-l688-7-342750-ppp- -- unhandled response
The webserver blocks me (before I can get the website),I try changing the IP, but nothing happend.
Exists any form to avoid this lock? (Also I don't know which type of lock is this)
Grettings

Your code works for me, apparently you kicked their resource a few times as unprotected user (without proper HTTP headers) and they've blocked your IP.
happens to the best of us :)

Related

Webcrawler skipping URLS

I'm writing a program that scans for vulnerable websites, I happen to know that there are a couple sites that have vulnerabilities, and return a SQL syntax error, however, when I run the program, it skips over these sites and doesn't output that they where found or output that they where saved into a file. This program is being used for pentesting and all owners of sites are made aware of the vulnerability.
Source:
def get_urls
info("Searching for possible SQL vulnerable sites.")
#agent = Mechanize.new
page = #agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = #agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
urls_to_log = urls.gsub("%3F", '?').gsub("%3D", '=')
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/SQL_sites_to_check.txt", "a+") {|s| s.puts("#{urls_to_log}'")}
end
end
info("Possible vulnerable sites dumped into #{PATH}/temp/SQL_sites_to_check.txt")
end
def check_if_vulnerable
info("Checking if sites are vulnerable.")
IO.read("#{PATH}/temp/SQL_sites_to_check.txt").each_line do |parse|
begin
Timeout::timeout(5) do
parsing = Nokogiri::HTML(RestClient.get("#{parse.chomp}"))
end
rescue Timeout::Error, RestClient::ResourceNotFound, RestClient::SSLCertificateNotVerified, Errno::ECONNABORTED, Mechanize::ResponseCodeError, RestClient::InternalServerError => e
if e
warn("URL: #{parse.chomp} failed with error: [#{e}] dumped to non_exploitable.txt")
File.open("#{PATH}/lib/non_exploitable.txt", "a+"){|s| s.puts(parse)}
else
success("SQL syntax error discovered in URL: #{parse.chomp} dumped to SQL_VULN.txt")
File.open("#{PATH}/lib/SQL_VULN.txt", "a+"){|vuln| vuln.puts(parse)}
end
end
end
end
Example of usage:
[22:49:29 INFO]Checking if sites are vulnerable.
[22:49:53 WARNING]URL: http://www.police.bd/content.php?id=275' failed with error: [execution expired] dumped to non_exploitable.txt
File containing the URLs:
http://www.bible.com/subcat.php?id=2'
http://www.cidko.com/pro_con.php?id=3'
http://www.slavsandtat.com/about.php?id=25'
http://www.police.bd/content.php?id=275'
http://www.icdcprage.org/index.php?id=10'
http://huawei.com/en/plugin.php?id=hwdownload'
https://huawei.com/en/plugin.php?id=unlock'
https://facebook.com/profile.php?id'
http://www.footballclub.com.au/index.php?id=43'
http://www.mesrs.qc.ca/index.php?id=1525'
As you can see the program skips over 3 URLs and goes straight to the fourth one, why?
Am I doing something wrong to where this will happen?
I'm not sure if that rescue block is where it should be. You are not doing anything with the content you fetch in parsing = Nokogiri::HTML(RestClient.get("#{parse.chomp}")) and for the first three it maybe just works hence no exception and no error output. Add some output after that line to see them being fetched.

Test existance of a page using mechanize

I want to test if an url exist before downloading it
I usully do this
agent=Mechanize.New
page=agent.get("www.some_url.com/atributes")
but insted of that I want to test if a page is attributed to that url before downloading it
The only way to see if a page exists (and that you can reach it via the internet) is to perform an actual request. You could first do a HTTP HEAD request, which only requests the headers, not the actual content:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
agent.head(url)
page_exists = true
rescue SocketError
page_exists = false
end
if page_exists
page = agent.get(url)
# do something with page ...
end
But then again, you can just get rid of the extra request and rescue from errors directly with the GET request:
url = "www.some_url.com/atributes"
agent = Mechanize.New
begin
page = agent.get(url)
# do something with page ...
rescue SocketError
puts "There is no such page."
end

How do I follow URL redirection?

I have a URL and I need to retrieve the URL it redirects to (the number of redirections is arbitrary).
One real example I'm working on is:
https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q
which will eventually redirect to:
http://company.zynga.com/privacy/policy
which is the URL I'm interested in.
I tried with open-uri as follows:
privacy_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
open(privacy_url) do |h|
puts "Redirecting to #{h.base_uri}"
final_url = h.base_uri
end
but I keep getting the original URL back, meaning that final_url is equal to privacy_url.
Is there any way to follow this kind of redirection and programmatically access the resulting URL?
I finally made it, using the Mechanize gem. They key is to enable the follow_meta_refresh options, which is disabled by default.
Here's how
require 'mechanize'
browser = Mechanize.new
browser.follow_meta_refresh = true
start_url = "https://www.google.com/url?q=http://m.zynga.com/about/privacy-center/privacy-policy&sa=D&usg=AFQjCNESJyXBeZenALhKWb52N1vHouAd5Q"
final_url = nil
browser.get(start_url) do |page|
final_url = page.uri.to_s
end
puts final_url # => http://company.zynga.com/privacy/policy

Getting the contents of a 404 error page response ruby

I know some languages have a library that allows you to get the HTTP content for a 404 or 500 message.
Is there a library that allows that for Ruby?
I've tried open-uri but it simply returns an HTTPError exception without the HTML content for the 404 response.
This doesn't seem to be stated clearly enough in the docs, but HttpError has an io attribute, which you can treat as a read only file as far as i know.
require 'open-uri'
begin
response = open('http://google.com/blahblah')
rescue => e
puts e # Error message
puts e.io.status # Http Error code
puts e.io.readlines # Http response body
end
Net::HTTP supports what you need.
You can use the request_get method and it will return a response regardless of the status code.
From script/console:
> http = Net::HTTP.new('localhost', 3000)
=> #<Net::HTTP localhost:3000 open=false>
> resp = http.request_get('/foo') # a page that doesn't exist
=> #<Net::HTTPNotFound 404 Not Found readbody=true>
> resp.code
=> "404"
> resp.body
=> "<html>...</html>"
(If the library is not available to you by default, you can do a require 'net/http'
Works with HTTParty as well https://github.com/jnunemaker/httparty
require 'rubygems'
require 'httparty'
HTTParty.get("http://google.com/blahblah").parsed_response
There are a number of HTTP Clients available, choose one you like from https://www.ruby-toolbox.com/categories/http_clients

How to get redirect log in Mechanize?

In ruby, if you use mechanize following 301/302 redirects like this
require 'mechanize'
m = WWW::Mechanize.new
m.get('http://google.com')
how to get the list of the pages mechanize was redirected through? (Like http://google.com => http://www.google.com => http://google.com.ua)
OK, here is the code in mechanize responsible for redirection
elsif res_klass <= Net::HTTPRedirection
return page unless follow_redirect?
log.info("follow redirect to: #{ response['Location'] }") if log
from_uri = page.uri
raise RedirectLimitReachedError.new(page, redirects) if redirects + 1 > redirection_limit
redirect_verb = options[:verb] == :head ? :head : :get
page = fetch_page( :uri => response['Location'].to_s,
:referer => page,
:params => [],
:verb => redirect_verb,
:redirects => redirects + 1
)
#history.push(page, from_uri)
return page
but trying to m.history.map {|p| puts p.uri} shows 3 times the uri of last page..
The key here is to take advantage of the built in logging in Mechanize. Here's a full code sample using the built in Rails logging facilities.
require 'mechanize'
require 'logger'
mechanize_logger = Logger.new('log/mechanize.log')
mechanize_logger.level = Logger::INFO
url = 'http://google.com'
agent = Mechanize.new
agent.log = mechanize_logger
agent.get(url)
And then check the output of log/mechanize.log in your log directory and you'll see the whole mechanize process including the intermediate urls.
I'm not certain, but here are a couple of things to try:
see what's in m.history[i].uri after the get()
You might need something like:
for m.redirection_limit in 0..99
begin
m.get(url)
break
rescue WWW::Mechanize::RedirectLimitReachedError
# code here could get control at
# intermediate redirection levels
end
end

Resources