Crawling list of URLs and bypass those with no DNS - ruby

I am crawling a large list of URLS with Ruby but all the URLS I have are not active and not associated with a DNS. When I hit that url my crawler errors.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'colorize'
URL_LIST = [
'http://website.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
end
When I run this script and hit a bad url I receive an error like this:
/Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `initialize': getaddrinfo: nodename nor servname provided, or not known (SocketError)
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `open'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:879:in `block in connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:878:in `connect'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:852:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:583:in `start'
from /Users/<name>/.rvm/rubies/ruby-2.1.1/lib/ruby/2.1.0/net/http.rb:478:in `get_response'
from spider.rb:808:in `block in <main>'
from spider.rb:806:in `each'
from spider.rb:806:in `<main>'

Use a begin/rescue block to rescue the error and output error info in red:
URL_LIST = [
'http://website.com',
'http://sdfasdfwqeasdfasdfr.com',
'http://website.net'
]
URL_LIST.each do |url|
item = "#{url}"
begin
resp = Net::HTTP.get_response(URI.parse(item))
case resp.code.to_i
when 200
puts "Success: #{url}".green
when 301..303
new_url = resp['location']
puts "Redirect #{url} => #{new_url}".yellow
else
resp.code
end
rescue SocketError => e
puts "Error: #{url} - #{e}".red
end
end
The output will look like:
Redirect http://website.com => http://www.website.com/
Error: http://sdfasdfwqeasdfasdfr.com - getaddrinfo: nodename nor servname provided, or not known
Success: http://website.net

Related

400 Bad Request for Ruby RSS gem

I can't seem to get this RSS feed to work properly. I've tried Nokogiri and now RSS::Parser and neither work:
a = 'https://phys.org/rss-feed/biology-news/biology-other/'
URI.open(a) do |rss|
feed = RSS::Parser.parse(rss)
puts "Title: #{feed.channel.title}"
feed.items.each do |item|
puts "Item: #{item.title}"
end
end
The code is taken directly out of the docs: https://github.com/ruby/rss
The feed is valid, so I'm confused as to why there's a 400 error code.
What am I doing wrong? Anybody have insight as to how to get this RSS parsed?
Here is the error:
/Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:364:in `open_http': 400 Bad request (OpenURI::HTTPError)
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:741:in `buffer_open'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:212:in `block in open_loop'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:210:in `catch'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:210:in `open_loop'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:151:in `open_uri'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/open_uri_redirections-0.2.1/lib/open-uri/redirections_patch.rb:55:in `open_uri'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:721:in `open'
from /Users/user3/.rbenv/versions/3.1.2/lib/ruby/3.1.0/open-uri.rb:29:in `open'
from /users/user3/app.rb:1856:in `<main>'
The web server requires the request to have a User-Agent set in the headers. Without such a User-Agent header it returns the 400 error message.
require 'uri'
require 'open-uri'
require 'rss'
uri = URI.parse("https://phys.org/rss-feed/biology-news/biology-other/")
uri.open("User-Agent" => "Ruby/#{RUBY_VERSION}") do |rss|
feed = RSS::Parser.parse(rss)
puts "Title: #{feed.channel.title}"
feed.items.each do |item|
puts "Item: #{item.title}"
end
end
This code work for me.

Net::HTTP and Nokogiri - undefined method `body' for nil:NilClass (NoMethodError)

Thanks for your time. Somewhat new to OOP and Ruby and after synthesizing solutions from a few different stack overflow answers I've got myself turned around.
My goal is to write a script that parses a CSV of URLs using Nokogiri library. After trying and failing to use open-uri and the open-uri-redirections plugin to follow redirects, I settled on Net::HTTP and that got me moving...until I ran into URLs that have a 302 redirect specifically.
Here's the method I'm using to engage the URL:
require 'Nokogiri'
require 'Net/http'
require 'csv'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
url = URI.parse(uri_str)
#puts "The value of uri_str is: #{ uri_str}"
#puts "The value of URI.parse(uri_str) is #{ url }"
req = Net::HTTP::Get.new(url.path, { 'User-Agent' => 'Mozilla/5.0 (etc...)' })
# puts "THE URL IS #{url.scheme + ":" + url.host + url.path}" # just a reporter so I can see if it's mangled
response = Net::HTTP.start(url.host, url.port, :use_ssl => url.scheme == 'https') { |http| http.request(req) }
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
#puts "Problem clause!"
response.error!
end
end
Further down in my script I take an ARGV with the URL csv filename, do CSV.read, encode the URL to a string, then use Nokogiri::HTML.parse to turn it all into something I can use xpath selectors to examine and then write to an output CSV.
Works beautifully...so long as I encounter a 200 response, which unfortunately is not every website. When I run into a 302 I'm getting this:
C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1570:in `addr_port': undefined method `+' for nil:NilClass (NoMethodError)
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1503:in `begin_transport'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1442:in `transport_request'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1416:in `request'
from httpcsv.rb:14:in `block in fetch'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:877:in `start'
from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:608:in `start'
from httpcsv.rb:14:in `fetch'
from httpcsv.rb:17:in `fetch'
from httpcsv.rb:42:in `block in <main>'
from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
from httpcsv.rb:38:in `<main>'
I know I'm missing something right in front of me but I can't tell what I should puts to see if it is nil. Any help is appreciated, thanks in advance.

how to test open-uri url exist before processing any data

I'm trying to process content from a list of links using "open-uri" in ruby (1.8.6), but the bad thing happens when I'm getting an error when one link is broken or requires authentication:
open-uri.rb:277:in `open_http': 404 Not Found (OpenURI::HTTPError)
from C:/tools/Ruby/lib/ruby/1.8/open-uri.rb:616:in `buffer_open'
from C:/tools/Ruby/lib/ruby/1.8/open-uri.rb:164:in `open_loop'
from C:/tools/Ruby/lib/ruby/1.8/open-uri.rb:162:in `catch'
or
C:/tools/Ruby/lib/ruby/1.8/net/http.rb:560:in `initialize': getaddrinfo: no address associated with hostname. (SocketError)
from C:/tools/Ruby/lib/ruby/1.8/net/http.rb:560:in `open'
from C:/tools/Ruby/lib/ruby/1.8/net/http.rb:560:in `connect'
from C:/tools/Ruby/lib/ruby/1.8/timeout.rb:53:in `timeout'
or
C:/tools/Ruby/lib/ruby/1.8/net/protocol.rb:133:in `sysread': An existing connection was forcibly closed by the remote host. (Errno::ECONNRESET)
from C:/tools/Ruby/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill'
from C:/tools/Ruby/lib/ruby/1.8/timeout.rb:62:in `timeout'
from C:/tools/Ruby/lib/ruby/1.8/timeout.rb:93:in `timeout'
is there a way to test the response (url) before processing any data?
the code is:
require 'open-uri'
smth.css.each do |item|
open('item[:name]', 'wb') do |file|
file << open('item[:href]').read
end
end
Many thanks
You could try something along the lines of
require 'open-uri'
smth.css.each do |item|
begin
open('item[:name]', 'wb') do |file|
file << open('item[:href]').read
end
rescue => e
case e
when OpenURI::HTTPError
# do something
when SocketError
# do something else
else
raise e
end
rescue SystemCallError => e
if e === Errno::ECONNRESET
# do something else
else
raise e
end
end
end
I don't know of any way of testing the connection without opening it and trying, so rescuing these errors would be the only way I can think of. The thing to be aware of is that OpenURI::HTTPError and SocketError are both subclasses of StandardError, whereas Errno::ECONNRESET is a subclass of SystemCallError. So rescue => e won't catch Errno::ECONNRESET.
I was able to solve this problem by using a conditional if/else statement to check the return value of the action for "failure":
def controller_action
url = "some_API"
response = open(url).read
data = JSON.parse(response)["data"]
if response["status"] == "failure"
redirect_to :action => "home"
else
do_something_else
end
end

Ruby Net::HTTP time out

I'm trying to write my first Ruby program, but have a problem. The code has to download 32 MP3 files over HTTP. It actually downloads a few, then times-out.
I tried setting a timeout period, but it makes no difference. Running the code under Windows, Cygwin and Mac OS X has the same result.
This is the code:
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'set'
require 'net/http'
require 'uri'
puts "\n Up and running!\n\n"
links_set = {}
pages = ['http://www.vimeo.com/siai/videos/sort:oldest',
'http://www.vimeo.com/siai/videos/page:2/sort:oldest',
'http://www.vimeo.com/siai/videos/page:3/sort:oldest']
pages.each do |page|
doc = Nokogiri::HTML(open(page))
doc.search('//*[#href]').each do |m|
video_id = m[:href]
if video_id.match(/^\/(\d+)$/i)
links_set[video_id[/\d+/]] = m.children[0].to_s.split(" at ")[0].split(" -- ")[0]
end
end
end
links = links_set.to_a
p links
cookie = ''
file_name = ''
open("http://www.tubeminator.com") {|f|
cookie = f.meta['set-cookie'].split(';')[0]
}
links.each do |link|
open("http://www.tubeminator.com/ajax.php?function=downloadvideo&url=http%3A%2F%2Fwww.vimeo.com%2F" + link[0],
"Cookie" => cookie) {|f|
puts f.read
}
open("http://www.tubeminator.com/ajax.php?function=convertvideo&start=0&duration=1120&size=0&format=mp3&vq=high&aq=high",
"Cookie" => cookie) {|f|
file_name = f.read
}
puts file_name
Net::HTTP.start("www.tubeminator.com") { |http|
#http.read_timeout = 3600 # 1 hour
resp = http.get("/download-video-" + file_name)
open(link[1] + ".mp3", "wb") { |file|
file.write(resp.body)
}
}
end
puts "\n Yay!!"
And this is the exception:
/Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/protocol.rb:140:in `rescue in rbuf_fill': Timeout::Error (Timeout::Error)
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/protocol.rb:134:in `rbuf_fill'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/protocol.rb:116:in `readuntil'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/protocol.rb:126:in `readline'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/http.rb:2138:in `read_status_line'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/http.rb:2127:in `read_new'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/http.rb:1120:in `transport_request'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/http.rb:1106:in `request'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:312:in `block in open_http'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/net/http.rb:564:in `start'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:306:in `open_http'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:767:in `buffer_open'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:203:in `block in open_loop'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:201:in `catch'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:201:in `open_loop'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:146:in `open_uri'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:669:in `open'
from /Users/test/.rvm/rubies/ruby-1.9.2-preview1/lib/ruby/1.9.1/open-uri.rb:33:in `open'
from test.rb:38:in `block in <main>'
from test.rb:37:in `each'
from test.rb:37:in `<main>'
I'd also appreciate your comments on the rest of the code.
For Ruby 1.8 I used this to solve my time-out issues. Extending the Net::HTTP class in my code and re-initialized with default parameters including an initialization of my own read_timeout should keep things sane I think.
require 'net/http'
# Lengthen timeout in Net::HTTP
module Net
class HTTP
alias old_initialize initialize
def initialize(*args)
old_initialize(*args)
#read_timeout = 5*60 # 5 minutes
end
end
end
Your timeout isn't in the code you set the timeout for. It's here, where you use open-uri:
open("http://www.tubeminator.com/ajax.php?function=downloadvideo&url=http%3A%2F%2Fwww.vimeo.com%2F" + link[0],
You can set a read timeout for open-uri like so:
#!/usr/bin/ruby1.9
require 'open-uri'
open('http://stackoverflow.com', 'r', :read_timeout=>0.01) do |http|
http.read
end
# => /usr/lib/ruby/1.9.0/net/protocol.rb:135:in `sysread': \
# => execution expired (Timeout::Error)
# => ...
# => from /tmp/foo.rb:5:in `<main>'
:read_timeout is new for Ruby 1.9 (it's not in Ruby 1.8). 0 or nil means "no timeout."

Limit to how many errors can be rescued?

I have a program that I'm using as a pentesting tool, I'm in the process of discovering if websites are SQL vulnerable and came across a Timeout::Error now I have tried to rescue the error but there's also a few other errors that need to be rescued as well. So my question is, is there a limit to how many errors can be rescued within a rescue block? And if not why is this Timeout not getting rescued?
Source:
def get_urls
info("Searching for possible SQL vulnerable sites.")
#agent = Mechanize.new
page = #agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = #agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
urls_to_log = urls.gsub("%3F", '?').gsub("%3D", '=')
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/SQL_sites_to_check.txt", "a+") {|s| s.puts("#{urls_to_log}'")}
end
end
info("Possible vulnerable sites dumped into #{PATH}/temp/SQL_sites.txt")
end
def check_if_vulnerable
info("Checking if sites are vulnerable.")
IO.read("#{PATH}/temp/SQL_sites_to_check.txt").each_line do |parse|
Timeout::timeout(5) do
begin
#parsing = Nokogiri::HTML(RestClient.get("#{parse.chomp}"))
rescue Timeout::Error, RestClient::ResourceNotFound, RestClient::SSLCertificateNotVerified
if RestClient::ResourceNotFound
warn("URL: #{parse.chomp} returned 404 error, URL dumped into 404 bin")
File.open("#{PATH}/lib/404_bin.txt", "a+"){|s| s.puts(parse)}
elsif RestClient::SSLCertificateNotVerified
err("URL: #{parse.chomp} requires SSL cert, url dumped into SSL bin")
File.open("#{PATH}/lib/SSL_bin.txt", "a+"){|s| s.puts(parse)}
elsif Timeout::Error
warn("URL: #{parse.chomp} failed to load resulting in time out after 10 seconds. URL dumped into TIMEOUT bin")
File.open("#{PATH}/lib/TIMEOUT_bin.txt", "a+"){|s| s.puts(parse)}
end
end
end
end
end
Error:
C:/Ruby22/lib/ruby/2.2.0/net/http.rb:892:in `new': execution expired (Timeout::E
rror)
from C:/Ruby22/lib/ruby/2.2.0/net/http.rb:892:in `connect'
from C:/Ruby22/lib/ruby/2.2.0/net/http.rb:863:in `do_start'
from C:/Ruby22/lib/ruby/2.2.0/net/http.rb:852:in `start'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0-x86-mingw32/li
b/restclient/request.rb:413:in `transmit'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0-x86-mingw32/li
b/restclient/request.rb:176:in `execute'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0-x86-mingw32/li
b/restclient/request.rb:41:in `execute'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/rest-client-1.8.0-x86-mingw32/li
b/restclient.rb:65:in `get'
from whitewidow.rb:94:in `block (2 levels) in check_if_vulnerable'
from C:/Ruby22/lib/ruby/2.2.0/timeout.rb:88:in `block in timeout'
from C:/Ruby22/lib/ruby/2.2.0/timeout.rb:32:in `block in catch'
from C:/Ruby22/lib/ruby/2.2.0/timeout.rb:32:in `catch'
from C:/Ruby22/lib/ruby/2.2.0/timeout.rb:32:in `catch'
from C:/Ruby22/lib/ruby/2.2.0/timeout.rb:103:in `timeout'
from whitewidow.rb:92:in `block in check_if_vulnerable'
from whitewidow.rb:91:in `each_line'
from whitewidow.rb:91:in `check_if_vulnerable'
from whitewidow.rb:113:in `<main>'
As you can see in the check_vulns method I have the Timeout::Error rescued. So what is causing this to timeout without moving to the next URL? I've tried adding a next to the rescue but it still doesn't work, help please?
By simply moving the Timeout I can rescue the error
def check_if_vulnerable
info("Checking if sites are vulnerable.")
IO.read("#{PATH}/temp/SQL_sites_to_check.txt").each_line do |parse|
begin
Timeout::timeout(5) do
#parsing = Nokogiri::HTML(RestClient.get("#{parse.chomp}"))
end
rescue Timeout::Error, RestClient::ResourceNotFound, RestClient::SSLCertificateNotVerified
if RestClient::ResourceNotFound
warn("URL: #{parse.chomp} returned 404 error, URL dumped into 404 bin")
File.open("#{PATH}/lib/404_bin.txt", "a+"){|s| s.puts(parse)}
elsif RestClient::SSLCertificateNotVerified
err("URL: #{parse.chomp} requires SSL cert, url dumped into SSL bin")
File.open("#{PATH}/lib/SSL_bin.txt", "a+"){|s| s.puts(parse)}
elsif Timeout::Error
warn("URL: #{parse.chomp} failed to load resulting in time out after 10 seconds. URL dumped into TIMEOUT bin")
File.open("#{PATH}/lib/TIMEOUT_bin.txt", "a+"){|s| s.puts(parse)}
end
end
end
end
end

Resources