Using mechanize to validate a set of url's - ruby

I'm trying to validate an array of urls using mechanize. I am getting a 404 for one of the urls that is ending my loop instead of going through the rescue. I want the loop to continue even if it hits a 404. Am I doing something wrong with the begin/rescue syntax? I'm just displaying them in terminal for the time being.
a.get(url) do |page|
begin
puts url
puts page.title
rescue Mechanize::ResponseCodeError, Net::HTTPNotFound
puts "404!- " + "#{url}"
next
end
end

You need your begin/rescue/end around a.get, i.e:
begin
a.get(url) do |page|
puts url
puts page.title
end
rescue Mechanize::ResponseCodeError, Net::HTTPNotFound
puts "404!- " + "#{url}"
next
end

Related

Invalid next compile error

I have a method that scans a URL for a website that contains an error:
def begin_vulnerability_check
info("Checking if sites are vulnerable.")
IO.read("#{PATH}/temp/SQL_sites_to_check.txt").each_line do |parse|
Timeout::timeout(10) do
parsing = Nokogiri::HTML(RestClient.get("#{parse.chomp}"))
info("Parsing page for SQL syntax error: #{parse.chomp}")
if parsing.css('html')[0].to_s[/You have an error in your SQL syntax/]
successful = parse
success("URL: #{parse.chomp} returned SQL syntax error, dumped to SQL_VULN.txt")
File.open("#{PATH}/lib/SQL_VULN.txt", "a+"){|s| s.puts(parse)}
sleep(1)
else
err("URL: #{parse.chomp} returned and error, dumped to non_exploitable.txt")
File.open("#{PATH}/lib/non_exploitable.txt", "a+"){|s| s.puts(parse)}
sleep(1)
end
end
end
end
During testing I'm scanning through this list of URLs:
http://www.bible.com/subcat.php?id=2'
http://www.cidko.com/pro_con.php?id=3'
http://www.slavsandtars.com/about.php?id=25'
http://www.police.gov/content.php?id=275'
http://www.icdprague.org/index.php?id=10'
http://huawei.com/en/plugin.php?id=hwdownload'
https://huawei.com/en/plugin.php?id=unlock'
https://facebook.com/profile.php?id'
http://www.footballclub.com.au/index.php?id=43'
http://www.mesrs.gouv/index.php?id=1525'
I also have a rescue block that is suppose to catch the exception Timeout::Error and move to the next URL in the list:
begin
begin_vulnerability_check
rescue Timeout::Error
if Timeout::Error
warn("Page timed out, this is usually cause by the page returning a white page, or being non-existent, skipping.")
next
end
end
However while attempting to run this program, I get the following error:
whitewidow.rb:130: Invalid next
whitewidow.rb: compile error (SyntaxError)
Line 130:
rescue Timeout::Error
if Timeout::Error
warn("Page timed out, this is usually cause by the page returning a white page, or being non-existent, skipping.")
next #<= HERE
end
end
My question being, am I using the next in the wrong sense? It seems to me like next would be, if this happens go to the next line, am I wrong for thinking like that? How can I refactor this to work?
You can use next to return from a block. You cannot use it outside a block like you're trying to do.
But you don't even need next, because when you rescue the timeout error the iteration will automatically continue with the next line. You just have to move the rescue inside the each_line iteration.
Your code should be something like this:
def begin_vulnerability_check
IO.read("#{PATH}/temp/SQL_sites_to_check.txt").each_line do |parse|
begin
Timeout::timeout(10) do
...
end
rescue Timeout::Error
# Will automatically continue with next line after this
end
end
end

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

How can I use EventMachine from within a Sinatra app?

I use an api, that is written on top of EM. This means that to make a call, I need to write something like the following:
EventMachine.run do
api.query do |result|
# Do stuff with result
end
EventMachine.stop
end
Works fine.
But now I want to use this same API within a Sinatra controller. I tried this:
get "/foo" do
output = ""
EventMachine.run do
api.query do |result|
output = "Result: #{result}"
end
EventMachine.stop
end
output
end
But this doesn't work. The run block is bypassed, so an empty response is returned and once stop is called, Sinatra shuts down.
Not sure if it's relevant, but my Sinatra app runs on Thin.
What am I doing wrong?
I've found a workaround by busy waiting until data becomes available. Possibly not the best solution, but it works at least:
helpers do
def wait_for(&block)
while (return_val = block.call).nil?
sleep(0.1)
end
return_val
end
end
get "/foo" do
output = nil
EventMachine.run do
api.query do |result|
output = "Result: #{result}"
end
end
wait_for { output }
end

Getting an error when looping

I'm writing some code that will pulls URLS from a text file and then check to see if they load or not. The code I have is:
require 'rubygems'
require 'watir'
require 'timeout'
Watir::Browser.default = "firefox"
browser = Watir::Browser.new
File.open('pl.txt').each_line do |urls|
begin
Timeout::timeout(10) do
browser.goto(urls.chomp)
if browser.text.include? "server"
puts 'here the page didnt'
else
puts 'here site was found'
File.open('works.txt', 'a') { |f| f.puts urls }
end
end
rescue Timeout::Error => e
puts e
end
end
browser.close
The thing is though I get the error:
execution expired
/Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/jssh_socket.rb:19:in `const_get': wrong number of arguments (2 for 1) (ArgumentError)
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/jssh_socket.rb:19:in `js_eval'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:303:in `open_window'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:94:in `get_window_number'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:103:in `goto'
from samplecodestack.rb:17
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/timeout.rb:62:in `timeout'
from samplecodestack.rb:16
from samplecodestack.rb:13:in `each_line'
from samplecodestack.rb:13
Anyone know how to get it working?
You can use net/http and handle the timeouts too.
require "net/http"
require "uri"
File.open('pl.txt').each_line do |urls|
uri = URI.parse(urls.chomp)
begin
response = Net::HTTP.get_response(uri)
rescue Exception=> e
puts e.message
puts "did not load!"
end
end
I had trouble following your stack trace but it seems to be on your goto statement.
execution expired is the error that occurs when the block for the Timeout::timeout is exceeded. Note that the timeout is checking that its entire block is completed in the specified time. Given the line number errors, I am guessing that the URL being loaded took close to 10 seconds and then the text check timed out.
I assume you really only mean for the timeout to occur if the page takes longer than 10 seconds to load, rather than the entire test taking 10 seconds to finish. So you should move the if statement out of the Timeout block:
File.open('pl.txt').each_line do |urls|
begin
Timeout::timeout(10) do
browser.goto(urls.chomp)
end
if browser.text.include? "server"
puts 'here the page didnt'
else
puts 'here site was found'
File.open('works.txt', 'a') { |f| f.puts urls }
end
rescue Timeout::Error => e
puts 'here the page took too long to load'
puts e
end
end

Is there a way to flush html to the wire in Sinatra

I have a Sinatra app with a long running process (a web scraper). I'd like the app flush the results of the crawler's progress as the crawler is running instead of at the end.
I've considered forking the request and doing something fancy with ajax but this is a really basic one-pager app that really just needs to output a log to a browser as it's happening. Any suggestions?
Update (2012-03-21)
As of Sinatra 1.3.0, you can use the new streaming API:
get '/' do
stream do |out|
out << "foo\n"
sleep 10
out << "bar\n"
end
end
Old Answer
Unfortunately you don't have a stream you can simply flush to (that would not work with Rack middleware). The result returned from a route block can simply respond to each. The Rack handler will then call each with a block and in that block flush the given part of the body to the client.
All rack responses have to always respond to each and always hand strings to the given block. Sinatra takes care of this for you, if you just return a string.
A simple streaming example would be:
require 'sinatra'
get '/' do
result = ["this", " takes", " some", " time"]
class << result
def each
super do |str|
yield str
sleep 0.3
end
end
end
result
end
Now you could simply place all your crawling in the each method:
require 'sinatra'
class Crawler
def initialize(url)
#url = url
end
def each
yield "opening url\n"
result = open #url
yield "seaching for foo\n"
if result.include? "foo"
yield "found it\n"
else
yield "not there, sorry\n"
end
end
end
get '/' do
Crawler.new 'http://mysite'
end

Resources