Download a file only if it exists with ruby - ruby

I'm doing a scraper to download all the issues of The Exile available at http://exile.ru/archive/list.php?IBLOCK_ID=35&PARAMS=ISSUE.
So far, my code is like this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
puts "Downloading issue #{number}"
open(numero) { |f|
File.open("#{DATA_DIR}/#{number}.pdf",'w') do |file|
file.puts f.read
end
}
end
puts "done"
The thing is, a lot of the issue links are down, and the code creates a PDF for every issue and, if it's empty, it will leave an empty PDF. How can I change the code so that it can only create and copy a file if the link exists?

require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
url_template = "http://exile.ru/docs/pdf/issues/exile%d.pdf"
filename_template = "#{DATA_DIR}/%d.pdf"
(120..290).each do |number|
pdf_url = url_template % number
print "Downloading issue #{number}"
# Opening the URL downloads the remote file.
open(pdf_url) do |pdf_in|
if pdf_in.read(4) == '%PDF'
pdf_in.rewind
File.open(filename_template % number,'w') do |pdf_out|
pdf_out.write(pdf_in.read)
end
print " OK\n"
else
print " #{pdf_url} is not a PDF\n"
end
end
end
puts "done"
open(url) downloads the file and provides a handle to a local temp file. A PDF starts with '%PDF'. After reading the first 4 characters, if the file is a PDF, the file pointer has to be put back to the beginning to capture the whole file when writing a local copy.

you can use this code to check if exist the file:
require 'net/http'
def exist_the_pdf?(url_pdf)
url = URI.parse(url_pdf)
Net::HTTP.start(url.host, url.port) do |http|
puts http.request_head(url.path)['content-type'] == 'application/pdf'
end
end

Try this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
open(numero) { |f|
content = f.read
if content.include? "Link is missing"
puts "Issue #{number} doesnt exists"
else
puts "Issue #{number} exists"
File.open("./#{number}.pdf",'w') do |file|
file.write(content)
end
end
}
end
puts "done"
The main thing I added is a check to see if the string "Link is missing". I wanted to do it using HTTP status codes but they always give a 200 back, which is not the best practice.
The thing to note is that with my code you always download the whole site to look for that string, but I don't have any other idea to fix it at the moment.

Related

404 not found. CSV and Ruby

im trying to convert links into images in different formats (jpg,pdf) and so on. I tried it earlier today and it worked fine until the last 500 links because my internet had a hiccup. So I removed all the converted links and was going to go at it again, but this time nothing is working. The program is running but cant seem to download the image and thus getting the error "404not found"
require 'open-uri'
require 'tempfile'
require 'uri'
require 'csv'
DOWNLOAD_DIR = "#{Dir.pwd}/BD/"
CSV_FILE = "#{Dir.pwd}/konvertera.csv"
def downloadFile(id, url, format)
open("#{DOWNLOAD_DIR}#{id}.#{format}", "wb+") do |file|
file << open(url).read
puts "Successfully downloaded #{url} to #{DOWNLOAD_DIR}#{id}.#{format}"
end
rescue
puts "404 not found #{url}"
end
CSV.foreach(CSV_FILE, headers: true, col_sep: ";") do |row|
puts row[0],row[1]
next unless row[0] && row[1]
id = row[0]
format = row[1].match(/BD\.(.+)$/)&.captures.first
puts format
url = row[1].gsub ".pdf", ""
downloadFile(id, url, format)
end

/settings/ads/ Keeps popping up while scraping Google

I have a program that scrapes Google, it's an open source vulnerability scraper that uses mechanize to search Google. It uses a random search query provided in a text file to decide what to search for.
I'll post the main file and a link to the git due to the size of the program.
Anyways, I have this program that is used to scrape for sites, however, while it is scraping every now and then it comes across a 'URL' (I say that lightly) that looks like this:
[17:05:02 INFO]I'll run in default mode!
[17:05:02 INFO]I'm searching for possible SQL vulnerable sites, using search query inurl:/main.php?f1=
[17:05:04 SUCCESS]Site found: http://forix.autosport.com/main.php?l=0&c=1
[17:05:05 SUCCESS]Site found: https://zweeler.com/formula1/FantasyFormula12016/main.php?ref=103
[17:05:06 SUCCESS]Site found: https://en.zweeler.com/formula1/FantasyFormula1YearGame2015/main.php
[17:05:07 SUCCESS]Site found: http://modelcargo.com/main.php?mod=sambachoose&dep=samba
[17:05:08 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=rules&f=8
[17:05:09 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=tracks&g=2&d=2&m=0
[17:05:11 SUCCESS]Site found: http://zoohoo.sk/redir.php?q=v%FDsledok&url=http%3A%2F%2Flivescore.sk%2Fmain.php%3Flang%3Dsk
[17:05:12 SUCCESS]Site found: http://www.chemical-plus.com/main.php?f1=pearl_pigment.htm
[17:05:13 SUCCESS]Site found: http://www.fantasyf1.co/main.php
[17:05:14 SUCCESS]Site found: http://www.escritores.cl/base.php?f1=escritores/main.php
[17:05:15 SUCCESS]Site found: /settings/ads/preferences?hl=en #<= Right here
When this shows up, it completely crashes the program. I've tried doing the following:
next if urls == '/settings/ads/preferences?hl=en'
next if urls =~ /preferences?hl=en/
next if urls.split('/')[2] == 'ads/preferences?hl=en'
However, it keeps popping up. Also I should mention, the last 5 characters depend on your locations, so far I've seen:
hl=en
hl=ru
hl=ia
Does anybody have any idea what this is, I've done some research and literally can't find anything on it. Any help with this would be fantastic.
Main source:
#!/usr/local/env ruby
require 'rubygems'
require 'bundler/setup'
require 'mechanize'
require 'nokogiri'
require 'rest-client'
require 'timeout'
require 'uri'
require 'fileutils'
require 'colored'
require 'yaml'
require 'date'
require 'optparse'
require 'tempfile'
require 'socket'
require 'net/http'
require_relative 'lib/modules/format.rb'
require_relative 'lib/modules/credits.rb'
require_relative 'lib/modules/legal.rb'
require_relative 'lib/modules/spider.rb'
require_relative 'lib/modules/copy.rb'
require_relative 'lib/modules/site_info.rb'
include Format
include Credits
include Legal
include Whitewidow
include Copy
include SiteInfo
PATH = Dir.pwd
VERSION = Whitewidow.version
SEARCH = File.readlines("#{PATH}/lib/search_query.txt").sample
info = YAML.load_file("#{PATH}/lib/rand-agents.yaml")
#user_agent = info['user_agents'][info.keys.sample]
OPTIONS = {}
def usage_page
Format.usage("You can run me with the following flags: #{File.basename(__FILE__)} -[d|e|h] -[f] <path/to/file/if/any>")
exit
end
def examples_page
Format.usage('This is my examples page, I\'ll show you a few examples of how to get me to do what you want.')
Format.usage('Running me with a file: whitewidow.rb -f <path/to/file> keep the file inside of one of my directories.')
Format.usage('Running me default, if you don\'t want to use a file, because you don\'t think I can handle it, or for whatever reason, you can run me default by passing the Default flag: whitewidow.rb -d this will allow me to scrape Google for some SQL vuln sites, no guarentees though!')
Format.usage('Running me with my Help flag will show you all options an explanation of what they do and how to use them')
Format.usage('Running me without a flag will show you the usage page. Not descriptive at all but gets the point across')
end
OptionParser.new do |opt|
opt.on('-f FILE', '--file FILE', 'Pass a file name to me, remember to drop the first slash. /tmp/txt.txt <= INCORRECT tmp/text.txt <= CORRECT') { |o| OPTIONS[:file] = o }
opt.on('-d', '--default', 'Run me in default mode, this will allow me to scrape Google using my built in search queries.') { |o| OPTIONS[:default] = o }
opt.on('-e', '--example', 'Shows my example page, gives you some pointers on how this works.') { |o| OPTIONS[:example] = o }
end.parse!
def page(site)
Nokogiri::HTML(RestClient.get(site))
end
def parse(site, tag, i)
parsing = page(site)
parsing.css(tag)[i].to_s
end
def format_file
Format.info('Writing to temporary file..')
if File.exists?(OPTIONS[:file])
file = Tempfile.new('file')
IO.read(OPTIONS[:file]).each_line do |s|
File.open(file, 'a+') { |format| format.puts(s) unless s.chomp.empty? }
end
IO.read(file).each_line do |file|
File.open("#{PATH}/tmp/#sites.txt", 'a+') { |line| line.puts(file) }
end
file.unlink
Format.info("File: #{OPTIONS[:file]}, has been formatted and saved as #sites.txt in the tmp directory.")
else
puts <<-_END_
Hey now my friend, I know you're eager, I am also, but that file #{OPTIONS[:file]}
either doesn't exist, or it's not in the directory you say it's in..
I'm gonna need you to go find that file, move it to the correct directory and then
run me again.
Don't worry I'll wait!
_END_
.yellow.bold
end
end
def get_urls
Format.info("I'll run in default mode!")
Format.info("I'm searching for possible SQL vulnerable sites, using search query #{SEARCH}")
agent = Mechanize.new
agent.user_agent = #user_agent
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
next if urls.split('/')[2].start_with? 'stackoverflow.com', 'github.com', 'www.sa-k.net', 'yoursearch.me', 'search1.speedbit.com', 'duckfm.net', 'search.clearch.org', 'webcache.googleusercontent.com'
next if urls == '/settings/ads/preferences?hl=en' #<= ADD HERE REMEMBER A COMMA =>
urls_to_log = URI.decode(urls)
Format.success("Site found: #{urls_to_log}")
sleep(1)
sql_syntax = ["'", "`", "--", ";"].each do |sql|
File.open("#{PATH}/tmp/SQL_sites_to_check.txt", 'a+') { |s| s.puts("#{urls_to_log}#{sql}") }
end
end
end
Format.info("I've dumped possible vulnerable sites into #{PATH}/tmp/SQL_sites_to_check.txt")
end
def vulnerability_check
case
when OPTIONS[:default]
file_to_read = "tmp/SQL_sites_to_check.txt"
when OPTIONS[:file]
Format.info("Let's check out this file real quick like..")
file_to_read = "tmp/#sites.txt"
end
Format.info('Forcing encoding to UTF-8') unless OPTIONS[:file]
IO.read("#{PATH}/#{file_to_read}").each_line do |vuln|
begin
Format.info("Parsing page for SQL syntax error: #{vuln.chomp}")
Timeout::timeout(10) do
vulns = vuln.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
begin
if parse("#{vulns.chomp}'", 'html', 0)[/You have an error in your SQL syntax/]
Format.site_found(vulns.chomp)
File.open("#{PATH}/tmp/SQL_VULN.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
else
Format.warning("URL: #{vulns.chomp} is not vulnerable, dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
end
rescue Timeout::Error, OpenSSL::SSL::SSLError
Format.warning("URL: #{vulns.chomp} failed to load dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
next
sleep(1)
end
end
rescue RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout, RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden, OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET, Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices, RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection, RestClient::MaxRedirectsReached => e
Format.err("URL: #{vuln.chomp} failed due to an error while connecting, URL dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vuln) }
next
end
end
end
case
when OPTIONS[:default]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
get_urls
vulnerability_check unless File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
Format.warn("No sites found for search query: #{SEARCH}. Logging into error_log.LOG. Create a issue regarding this.") if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.open("#{PATH}/log/error_log.LOG", 'a+') { |s| s.puts("No sites found with search query #{SEARCH}") } if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG")
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:file]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
Format.info('Formatting file')
format_file
vulnerability_check
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG") unless File.size("#{PATH}/log/SQL_VULN.LOG") == 0
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:example]
examples_page
else
Format.warning('You failed to pass me a flag!')
usage_page
end
IS there anything within this code, that would cause this to randomly popup? It only happens with random search queries.
Link to GitHub
UPDATE:
Ive discovered that Googles advertisement services link has the same extension in its URL as the one giving me problems.. However this doesn't explain why I'm getting this link, and why I can't seem to skip over it.
urls = "settings/ads/preferences?hl=ru"
if urls =~ /settings\/ads\/preferences\?hl=[a-z]{2}/
p "I'm skipped"
end
=> "I'm skipped"

What SHOULD I be seeing to know this code has done what it was supposed to?

I am doing the EventManager tutorial from Jumpstart Labs. Originally I could not get it my .rb file to read a .erb file, and I think I may have solved that, but I am not sure as I do not know what I SHOULD be seeing if all is running correctly and unfortunately the tutorial doesn't tell you. Here is my original question
Now after a simple change, I am no longer getting the error - but I am also not getting any indication that the code is working as expected. The tutorial says that this code should be creating a new directory and storing a copy of each 'thank you' letter into a file called 'output' in that new directory. But when I run it all I see is the => EventManager initialized from the terminal, which tells me that my .rb is being read and (I think) that the .erb is finally being read...but I do not see any new directories/files in the file structure, nor any indication that anything was created - so I can't tell if it is actually doing anything.
I am kind of expecting to see some kind of message telling me the directory has been created, perhaps with a file path or something.
I have never done anything like this and I'm not sure what I should be seeing...can anyone tell me how I would know that this code is preforming as expected? And if it is not, why?
require "csv"
require "sunlight/congress"
require "erb"
Sunlight::Congress.api_key = "e179a6973728c4dd3fb1204283aaccb5"
def save_thank_you_letters(id, form_letter)
Dir.mkdir("output") unless Dir.exists? ("output")
filename = "output/thanks_#{id}.html"
File.open(filename, 'w') do |file|
file.puts form_letter
end
end
def legislators_by_zipcode(zipcode)
legislators = Sunlight::Congress::Legislator.by_zipcode(zipcode)
end
def clean_zipcode(zipcode)
zipcode.to_s.rjust(5,"0")[0..4]
end
puts "EventManager initialized."
contents = CSV.open "event_attendees.csv", headers: true, header_converters: :symbol
template_letter = File.read( "event_manager/form_letter.erb")
erb_template = ERB.new template_letter
contents.each do |row|
id = row[0]
name = row[:first_name]
zipcode = clean_zipcode(row[:zipcode])
legislators = legislators_by_zipcode(zipcode)
form_letter = erb_template.result(binding)
save_thank_you_letters(id, form_letter)
end
I've (ever so slightly) modified your save_thank_you_letters method to spit out some helpful information as it writes files:
def save_thank_you_letters(id, form_letter)
Dir.mkdir("output") unless Dir.exists? ("output")
filename = "output/thanks_#{id}.html"
File.open(filename, 'w') do |file|
file.puts form_letter
puts "Wrote ID: #{id} to #{filename}"
end
end
The line puts "Wrote ID: #{id} to #{filename}" will print the ID and file path of the message it has written. You can place additional puts "Your text here..." throughout your Ruby logic to print more information to the console as you see fit.
Side note: in general, it's a super bad idea to post your personal API keys to any public forums. If this key is private/unique to you, delete it and request a new one. Anyone can now impersonate your account at Sunlight Labs.

how do I save the parsed data to a file

I wounder how I can save the parsed data to a txt file. My script is only saving the last parsed. Do i need to add .each do ? kind of lost right now
here is my code and if maybe somebody could explain to me how save the parsed info on a new line
here is the code
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.clearsearch.se/foretag/-/q_advokat/1/"
doc = Nokogiri::HTML(open(url))
doc.css(".gray-border-bottom").each do |item|
title = item.css(".medium").text.strip
phone = item.css(".grayborderwrapper > .bold").text.strip
adress = item.css(".grayborder span").text.strip
www = item.css(".click2www").map { |link| link['href'] }
puts "#{title} ; \n"
puts "#{phone} ; \n"
puts "#{adress} ; \n"
puts "#{www} ; \n\n\n"
puts "Writing"
company = "#{title}; #{phone}; #{adress}; #{www} \n\n"
puts "saving"
file = File.open("exporterad.txt", "w")
file.write(company)
file.close
puts "done"
end
puts "done"
Calling File.open inside your loop truncates the file to zero length with each invocation. Instead, open the file outside your loop (using the block form):
File.open("exporterad.txt", "w") do |file|
doc.css(".gray-border-bottom").each do |item|
# ...
file.write(company)
# ...
end
end # <- file is closed automatically at the end of the block

open-uri Ruby Errors

I have the code:
require 'open-uri'
print "Enter a URL: "
add = gets
added = add.sub!(/http:\/\//, "")
puts "Info from: #{add}"
open("#{add}") do |f|
img = f.read.scan(/<img/)
img = img.length
puts "\t#{img} images"
f.close
end
open("#{add}") do |f|
links = f.read.scan(/<a/)
links = links.length
puts "\t#{links} links"
f.close
end
open("#{add}") do |f|
div = f.read.scan(/<div/)
div = div.le1ngth
puts "\t#{div} div tags"
f.close
end
(Yes I know it isn't good code, don't comment about it please)
When I run it, and for the URL, I enter in, say:
http://stackoverflow.com
I get the following error:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `initialize': No such file or directory - http (Errno::ENOENT)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `open'
Why does this error come up and how can I fix it?
The String.sub! method replaces the string in place, so add.sub!(/http:\/\//, "") changes the value of add in addition to setting added.
To use the open(name) method with URIs, the value of name must start with a URI scheme, like http://.
If you want to set added, do so like so:
added = add.sub(/http:\/\//, "")

Resources