I have a program that scrapes Google, it's an open source vulnerability scraper that uses mechanize to search Google. It uses a random search query provided in a text file to decide what to search for.
I'll post the main file and a link to the git due to the size of the program.
Anyways, I have this program that is used to scrape for sites, however, while it is scraping every now and then it comes across a 'URL' (I say that lightly) that looks like this:
[17:05:02 INFO]I'll run in default mode!
[17:05:02 INFO]I'm searching for possible SQL vulnerable sites, using search query inurl:/main.php?f1=
[17:05:04 SUCCESS]Site found: http://forix.autosport.com/main.php?l=0&c=1
[17:05:05 SUCCESS]Site found: https://zweeler.com/formula1/FantasyFormula12016/main.php?ref=103
[17:05:06 SUCCESS]Site found: https://en.zweeler.com/formula1/FantasyFormula1YearGame2015/main.php
[17:05:07 SUCCESS]Site found: http://modelcargo.com/main.php?mod=sambachoose&dep=samba
[17:05:08 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=rules&f=8
[17:05:09 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=tracks&g=2&d=2&m=0
[17:05:11 SUCCESS]Site found: http://zoohoo.sk/redir.php?q=v%FDsledok&url=http%3A%2F%2Flivescore.sk%2Fmain.php%3Flang%3Dsk
[17:05:12 SUCCESS]Site found: http://www.chemical-plus.com/main.php?f1=pearl_pigment.htm
[17:05:13 SUCCESS]Site found: http://www.fantasyf1.co/main.php
[17:05:14 SUCCESS]Site found: http://www.escritores.cl/base.php?f1=escritores/main.php
[17:05:15 SUCCESS]Site found: /settings/ads/preferences?hl=en #<= Right here
When this shows up, it completely crashes the program. I've tried doing the following:
next if urls == '/settings/ads/preferences?hl=en'
next if urls =~ /preferences?hl=en/
next if urls.split('/')[2] == 'ads/preferences?hl=en'
However, it keeps popping up. Also I should mention, the last 5 characters depend on your locations, so far I've seen:
hl=en
hl=ru
hl=ia
Does anybody have any idea what this is, I've done some research and literally can't find anything on it. Any help with this would be fantastic.
Main source:
#!/usr/local/env ruby
require 'rubygems'
require 'bundler/setup'
require 'mechanize'
require 'nokogiri'
require 'rest-client'
require 'timeout'
require 'uri'
require 'fileutils'
require 'colored'
require 'yaml'
require 'date'
require 'optparse'
require 'tempfile'
require 'socket'
require 'net/http'
require_relative 'lib/modules/format.rb'
require_relative 'lib/modules/credits.rb'
require_relative 'lib/modules/legal.rb'
require_relative 'lib/modules/spider.rb'
require_relative 'lib/modules/copy.rb'
require_relative 'lib/modules/site_info.rb'
include Format
include Credits
include Legal
include Whitewidow
include Copy
include SiteInfo
PATH = Dir.pwd
VERSION = Whitewidow.version
SEARCH = File.readlines("#{PATH}/lib/search_query.txt").sample
info = YAML.load_file("#{PATH}/lib/rand-agents.yaml")
#user_agent = info['user_agents'][info.keys.sample]
OPTIONS = {}
def usage_page
Format.usage("You can run me with the following flags: #{File.basename(__FILE__)} -[d|e|h] -[f] <path/to/file/if/any>")
exit
end
def examples_page
Format.usage('This is my examples page, I\'ll show you a few examples of how to get me to do what you want.')
Format.usage('Running me with a file: whitewidow.rb -f <path/to/file> keep the file inside of one of my directories.')
Format.usage('Running me default, if you don\'t want to use a file, because you don\'t think I can handle it, or for whatever reason, you can run me default by passing the Default flag: whitewidow.rb -d this will allow me to scrape Google for some SQL vuln sites, no guarentees though!')
Format.usage('Running me with my Help flag will show you all options an explanation of what they do and how to use them')
Format.usage('Running me without a flag will show you the usage page. Not descriptive at all but gets the point across')
end
OptionParser.new do |opt|
opt.on('-f FILE', '--file FILE', 'Pass a file name to me, remember to drop the first slash. /tmp/txt.txt <= INCORRECT tmp/text.txt <= CORRECT') { |o| OPTIONS[:file] = o }
opt.on('-d', '--default', 'Run me in default mode, this will allow me to scrape Google using my built in search queries.') { |o| OPTIONS[:default] = o }
opt.on('-e', '--example', 'Shows my example page, gives you some pointers on how this works.') { |o| OPTIONS[:example] = o }
end.parse!
def page(site)
Nokogiri::HTML(RestClient.get(site))
end
def parse(site, tag, i)
parsing = page(site)
parsing.css(tag)[i].to_s
end
def format_file
Format.info('Writing to temporary file..')
if File.exists?(OPTIONS[:file])
file = Tempfile.new('file')
IO.read(OPTIONS[:file]).each_line do |s|
File.open(file, 'a+') { |format| format.puts(s) unless s.chomp.empty? }
end
IO.read(file).each_line do |file|
File.open("#{PATH}/tmp/#sites.txt", 'a+') { |line| line.puts(file) }
end
file.unlink
Format.info("File: #{OPTIONS[:file]}, has been formatted and saved as #sites.txt in the tmp directory.")
else
puts <<-_END_
Hey now my friend, I know you're eager, I am also, but that file #{OPTIONS[:file]}
either doesn't exist, or it's not in the directory you say it's in..
I'm gonna need you to go find that file, move it to the correct directory and then
run me again.
Don't worry I'll wait!
_END_
.yellow.bold
end
end
def get_urls
Format.info("I'll run in default mode!")
Format.info("I'm searching for possible SQL vulnerable sites, using search query #{SEARCH}")
agent = Mechanize.new
agent.user_agent = #user_agent
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
next if urls.split('/')[2].start_with? 'stackoverflow.com', 'github.com', 'www.sa-k.net', 'yoursearch.me', 'search1.speedbit.com', 'duckfm.net', 'search.clearch.org', 'webcache.googleusercontent.com'
next if urls == '/settings/ads/preferences?hl=en' #<= ADD HERE REMEMBER A COMMA =>
urls_to_log = URI.decode(urls)
Format.success("Site found: #{urls_to_log}")
sleep(1)
sql_syntax = ["'", "`", "--", ";"].each do |sql|
File.open("#{PATH}/tmp/SQL_sites_to_check.txt", 'a+') { |s| s.puts("#{urls_to_log}#{sql}") }
end
end
end
Format.info("I've dumped possible vulnerable sites into #{PATH}/tmp/SQL_sites_to_check.txt")
end
def vulnerability_check
case
when OPTIONS[:default]
file_to_read = "tmp/SQL_sites_to_check.txt"
when OPTIONS[:file]
Format.info("Let's check out this file real quick like..")
file_to_read = "tmp/#sites.txt"
end
Format.info('Forcing encoding to UTF-8') unless OPTIONS[:file]
IO.read("#{PATH}/#{file_to_read}").each_line do |vuln|
begin
Format.info("Parsing page for SQL syntax error: #{vuln.chomp}")
Timeout::timeout(10) do
vulns = vuln.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
begin
if parse("#{vulns.chomp}'", 'html', 0)[/You have an error in your SQL syntax/]
Format.site_found(vulns.chomp)
File.open("#{PATH}/tmp/SQL_VULN.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
else
Format.warning("URL: #{vulns.chomp} is not vulnerable, dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
end
rescue Timeout::Error, OpenSSL::SSL::SSLError
Format.warning("URL: #{vulns.chomp} failed to load dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
next
sleep(1)
end
end
rescue RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout, RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden, OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET, Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices, RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection, RestClient::MaxRedirectsReached => e
Format.err("URL: #{vuln.chomp} failed due to an error while connecting, URL dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vuln) }
next
end
end
end
case
when OPTIONS[:default]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
get_urls
vulnerability_check unless File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
Format.warn("No sites found for search query: #{SEARCH}. Logging into error_log.LOG. Create a issue regarding this.") if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.open("#{PATH}/log/error_log.LOG", 'a+') { |s| s.puts("No sites found with search query #{SEARCH}") } if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG")
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:file]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
Format.info('Formatting file')
format_file
vulnerability_check
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG") unless File.size("#{PATH}/log/SQL_VULN.LOG") == 0
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:example]
examples_page
else
Format.warning('You failed to pass me a flag!')
usage_page
end
IS there anything within this code, that would cause this to randomly popup? It only happens with random search queries.
Link to GitHub
UPDATE:
Ive discovered that Googles advertisement services link has the same extension in its URL as the one giving me problems.. However this doesn't explain why I'm getting this link, and why I can't seem to skip over it.
urls = "settings/ads/preferences?hl=ru"
if urls =~ /settings\/ads\/preferences\?hl=[a-z]{2}/
p "I'm skipped"
end
=> "I'm skipped"
I am writing my first ruby script and am curious how to actually have gem referenced in the script. I am unable to test the code before hand because it reads form an email in /etc/aliases through a pipe.
Any one one with experiences with ruby scripts to advise?
P.S So many bugs because not tested or refactored
Sample Script
#!/usr/bin/env ruby
# Reading files
mail = File.open(ARGV[0])
lines = []
mail.each_with_index do |i,line|
line[i] = lines.#remove leading and trailing spaces
end
first_line = line[1].strip
if line[1] /^(256)/
phone_number = first_line.gsub("+", "")
else
phone_number = "256#{first_line.gsub(/^0+/,"")}"
end
message = line[2].strip
# Sending message
url = "http://xxxxxxxxxxx.com/api/v2/json/messages?token=XXXXXXXXXXXXXXXXXXXXXXXXXXX&to=#{phone_number}&from=XXXXXX&message=#{CGI.escape(message)}"
5.times do |i|
response = HTTParty.get(url)
body = JSON.parse(response.body)
if body["status"] == "Success"
break
end
end
Gems in question are CGI, Httparty, and Json parsing.
Using external gems can be done by calling the "require" method.
So to include them in your script, the first few lines could be something like this:
#!/usr/bin/env ruby
require "json"
require "cgi"
require "httparty"
#rest of your code...
I assume you have installed your gems with gem install <gemname>?
I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.
I would like to use Sinatra's Streaming capability introduced in 1.3 coupled with some stdout redirection. It would basically be a live streaming output of a long running job. I looked into this question and the Sinatra streaming sample in the README.
Running 1.8.7 on OSX:
require 'stringio'
require 'sinatra'
$stdout.sync = true
module Kernel
def capture_stdout
out = StringIO.new
$stdout = out
yield out
ensure
$stdout = STDOUT
end
end
get '/' do
stream do |out|
out << "Part one of a three part series... <br>\n"
sleep 1
out << "...part two... <br>\n"
sleep 1
out << "...and now the conclusion...\n"
Kernel.capture_stdout do |stream|
Thread.new do
until (line = stream.gets).nil? do
out << line
end
end
method_that_prints_text
end
end
end
def method_that_prints_text
puts "starting long running job..."
sleep 3
puts "almost there..."
sleep 3
puts "work complete!"
end
So this bit of code prints out the first three strings properly, and blocks while the method_that_prints_text executes and does not print anything to the browser. My feeling is that stdout is empty on the first call and it never outputs to the out buffer. I'm not quite sure what the proper ordering would be and would appreciate any suggestions.
I tried a few of the EventMachine implementations mentioned in the question above, but couldn't get them to work.
UPDATE
I tried something slightly different to where I had the method run in a new thread, and override STDOUT for that thread as described here...
Instead of Kernel.capture_stdout above...
s = StringIO.new
Thread.start do
Thread.current[:stdout] = s
method_that_prints_text
end.join
while line = s.gets do
out << line
end
out << s.string
With the ThreadOut module listed in the link above, this seems to work a bit better. However it doesn't stream. The only time something is printed to the browser is on the final line out << s.string. Does StringIO not have the capability to stream?
I ended up solving this by discovering that s.string was updated periodically as time went on, so I just captured the output in a separate thread and grabbed the differences and streamed them out. It appears as though string redirection doesn't behave like a normal IO object.
s = StringIO.new
t = Thread.start do
Thread.current[:stdout] = s
method_that_prints_text
sleep 2
end
displayed_text = ''
while t.alive? do
current_text = s.string
unless (current_text.eql?(displayed_text))
new_text = current_text[displayed_text.length..current_text.length]
out << new_text
displayed_text = current_text * 1
end
sleep 2
end
I'm running a simple thin server, that publish some messages to different queues, the code looks like :
require "rubygems"
require "thin"
require "amqp"
require 'msgpack'
app = Proc.new do |env|
params = Rack::Request.new(env).params
command = params['command'].strip rescue "no command"
number = params['number'].strip rescue "no number"
p command
p number
AMQP.start do
if command =~ /\A(create|c|r|register)\z/i
MQ.queue("create").publish(number)
elsif m = (/\A(Answer|a)\s?(\d+|\d+-\d+)\z/i.match(command))
MQ.queue("answers").publish({:number => number,:answer => "answer" }.to_msgpack )
end
end
[200, {'Content-Type' => "text/plain"} , command ]
end
Rack::Handler::Thin.run(app, :Port => 4001)
Now when I run the server, and do something like http://0.0.0.0:4001/command=r&number=123123123
I'm always getting duplicate outputs, something like :
"no command"
"no number"
"no command"
"no number"
The first thing is why I'm getting like duplicate requests ? is it something has to do with the browser ? since when I use curl I'm not having the same behavior , and the second thing why I can't get the params ?
Any tips about the best implementation for such a server would be highly appreciated
Thanks in advance .
The second request comes from the browser looking for the favicon.ico. You can inspect the requests by adding the following code in your handler:
params = Rack::Request.new(env).params
p env # add this line to see the request in your console window
Alternatively you could use Sinatra:
require "rubygems"
require "amqp"
require "msgpack"
require "sinatra"
get '/:command/:number' do
command = params['command'].strip rescue "no command"
number = params['number'].strip rescue "no number"
p command
p number
AMQP.start do
if command =~ /\A(create|c|r|register)\z/i
MQ.queue("create").publish(number)
elsif m = (/\A(Answer|a)\s?(\d+|\d+-\d+)\z/i.match(command))
MQ.queue("answers").publish({:number => number,:answer => "answer" }.to_msgpack )
nd
end
return command
end
and then run ruby the_server.rb at the command line to start the http server.