Rejecting info from being stored in a file - ruby

I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk

It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}

There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

/settings/ads/ Keeps popping up while scraping Google

I have a program that scrapes Google, it's an open source vulnerability scraper that uses mechanize to search Google. It uses a random search query provided in a text file to decide what to search for.
I'll post the main file and a link to the git due to the size of the program.
Anyways, I have this program that is used to scrape for sites, however, while it is scraping every now and then it comes across a 'URL' (I say that lightly) that looks like this:
[17:05:02 INFO]I'll run in default mode!
[17:05:02 INFO]I'm searching for possible SQL vulnerable sites, using search query inurl:/main.php?f1=
[17:05:04 SUCCESS]Site found: http://forix.autosport.com/main.php?l=0&c=1
[17:05:05 SUCCESS]Site found: https://zweeler.com/formula1/FantasyFormula12016/main.php?ref=103
[17:05:06 SUCCESS]Site found: https://en.zweeler.com/formula1/FantasyFormula1YearGame2015/main.php
[17:05:07 SUCCESS]Site found: http://modelcargo.com/main.php?mod=sambachoose&dep=samba
[17:05:08 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=rules&f=8
[17:05:09 SUCCESS]Site found: http://www.ukdirt.co.uk/main.php?P=tracks&g=2&d=2&m=0
[17:05:11 SUCCESS]Site found: http://zoohoo.sk/redir.php?q=v%FDsledok&url=http%3A%2F%2Flivescore.sk%2Fmain.php%3Flang%3Dsk
[17:05:12 SUCCESS]Site found: http://www.chemical-plus.com/main.php?f1=pearl_pigment.htm
[17:05:13 SUCCESS]Site found: http://www.fantasyf1.co/main.php
[17:05:14 SUCCESS]Site found: http://www.escritores.cl/base.php?f1=escritores/main.php
[17:05:15 SUCCESS]Site found: /settings/ads/preferences?hl=en #<= Right here
When this shows up, it completely crashes the program. I've tried doing the following:
next if urls == '/settings/ads/preferences?hl=en'
next if urls =~ /preferences?hl=en/
next if urls.split('/')[2] == 'ads/preferences?hl=en'
However, it keeps popping up. Also I should mention, the last 5 characters depend on your locations, so far I've seen:
hl=en
hl=ru
hl=ia
Does anybody have any idea what this is, I've done some research and literally can't find anything on it. Any help with this would be fantastic.
Main source:
#!/usr/local/env ruby
require 'rubygems'
require 'bundler/setup'
require 'mechanize'
require 'nokogiri'
require 'rest-client'
require 'timeout'
require 'uri'
require 'fileutils'
require 'colored'
require 'yaml'
require 'date'
require 'optparse'
require 'tempfile'
require 'socket'
require 'net/http'
require_relative 'lib/modules/format.rb'
require_relative 'lib/modules/credits.rb'
require_relative 'lib/modules/legal.rb'
require_relative 'lib/modules/spider.rb'
require_relative 'lib/modules/copy.rb'
require_relative 'lib/modules/site_info.rb'
include Format
include Credits
include Legal
include Whitewidow
include Copy
include SiteInfo
PATH = Dir.pwd
VERSION = Whitewidow.version
SEARCH = File.readlines("#{PATH}/lib/search_query.txt").sample
info = YAML.load_file("#{PATH}/lib/rand-agents.yaml")
#user_agent = info['user_agents'][info.keys.sample]
OPTIONS = {}
def usage_page
Format.usage("You can run me with the following flags: #{File.basename(__FILE__)} -[d|e|h] -[f] <path/to/file/if/any>")
exit
end
def examples_page
Format.usage('This is my examples page, I\'ll show you a few examples of how to get me to do what you want.')
Format.usage('Running me with a file: whitewidow.rb -f <path/to/file> keep the file inside of one of my directories.')
Format.usage('Running me default, if you don\'t want to use a file, because you don\'t think I can handle it, or for whatever reason, you can run me default by passing the Default flag: whitewidow.rb -d this will allow me to scrape Google for some SQL vuln sites, no guarentees though!')
Format.usage('Running me with my Help flag will show you all options an explanation of what they do and how to use them')
Format.usage('Running me without a flag will show you the usage page. Not descriptive at all but gets the point across')
end
OptionParser.new do |opt|
opt.on('-f FILE', '--file FILE', 'Pass a file name to me, remember to drop the first slash. /tmp/txt.txt <= INCORRECT tmp/text.txt <= CORRECT') { |o| OPTIONS[:file] = o }
opt.on('-d', '--default', 'Run me in default mode, this will allow me to scrape Google using my built in search queries.') { |o| OPTIONS[:default] = o }
opt.on('-e', '--example', 'Shows my example page, gives you some pointers on how this works.') { |o| OPTIONS[:example] = o }
end.parse!
def page(site)
Nokogiri::HTML(RestClient.get(site))
end
def parse(site, tag, i)
parsing = page(site)
parsing.css(tag)[i].to_s
end
def format_file
Format.info('Writing to temporary file..')
if File.exists?(OPTIONS[:file])
file = Tempfile.new('file')
IO.read(OPTIONS[:file]).each_line do |s|
File.open(file, 'a+') { |format| format.puts(s) unless s.chomp.empty? }
end
IO.read(file).each_line do |file|
File.open("#{PATH}/tmp/#sites.txt", 'a+') { |line| line.puts(file) }
end
file.unlink
Format.info("File: #{OPTIONS[:file]}, has been formatted and saved as #sites.txt in the tmp directory.")
else
puts <<-_END_
Hey now my friend, I know you're eager, I am also, but that file #{OPTIONS[:file]}
either doesn't exist, or it's not in the directory you say it's in..
I'm gonna need you to go find that file, move it to the correct directory and then
run me again.
Don't worry I'll wait!
_END_
.yellow.bold
end
end
def get_urls
Format.info("I'll run in default mode!")
Format.info("I'm searching for possible SQL vulnerable sites, using search query #{SEARCH}")
agent = Mechanize.new
agent.user_agent = #user_agent
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls = str_list[1]
next if urls.split('/')[2].start_with? 'stackoverflow.com', 'github.com', 'www.sa-k.net', 'yoursearch.me', 'search1.speedbit.com', 'duckfm.net', 'search.clearch.org', 'webcache.googleusercontent.com'
next if urls == '/settings/ads/preferences?hl=en' #<= ADD HERE REMEMBER A COMMA =>
urls_to_log = URI.decode(urls)
Format.success("Site found: #{urls_to_log}")
sleep(1)
sql_syntax = ["'", "`", "--", ";"].each do |sql|
File.open("#{PATH}/tmp/SQL_sites_to_check.txt", 'a+') { |s| s.puts("#{urls_to_log}#{sql}") }
end
end
end
Format.info("I've dumped possible vulnerable sites into #{PATH}/tmp/SQL_sites_to_check.txt")
end
def vulnerability_check
case
when OPTIONS[:default]
file_to_read = "tmp/SQL_sites_to_check.txt"
when OPTIONS[:file]
Format.info("Let's check out this file real quick like..")
file_to_read = "tmp/#sites.txt"
end
Format.info('Forcing encoding to UTF-8') unless OPTIONS[:file]
IO.read("#{PATH}/#{file_to_read}").each_line do |vuln|
begin
Format.info("Parsing page for SQL syntax error: #{vuln.chomp}")
Timeout::timeout(10) do
vulns = vuln.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
begin
if parse("#{vulns.chomp}'", 'html', 0)[/You have an error in your SQL syntax/]
Format.site_found(vulns.chomp)
File.open("#{PATH}/tmp/SQL_VULN.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
else
Format.warning("URL: #{vulns.chomp} is not vulnerable, dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
sleep(1)
end
rescue Timeout::Error, OpenSSL::SSL::SSLError
Format.warning("URL: #{vulns.chomp} failed to load dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vulns) }
next
sleep(1)
end
end
rescue RestClient::ResourceNotFound, RestClient::InternalServerError, RestClient::RequestTimeout, RestClient::Gone, RestClient::SSLCertificateNotVerified, RestClient::Forbidden, OpenSSL::SSL::SSLError, Errno::ECONNREFUSED, URI::InvalidURIError, Errno::ECONNRESET, Timeout::Error, OpenSSL::SSL::SSLError, Zlib::GzipFile::Error, RestClient::MultipleChoices, RestClient::Unauthorized, SocketError, RestClient::BadRequest, RestClient::ServerBrokeConnection, RestClient::MaxRedirectsReached => e
Format.err("URL: #{vuln.chomp} failed due to an error while connecting, URL dumped to non_exploitable.txt")
File.open("#{PATH}/log/non_exploitable.txt", "a+") { |s| s.puts(vuln) }
next
end
end
end
case
when OPTIONS[:default]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
get_urls
vulnerability_check unless File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
Format.warn("No sites found for search query: #{SEARCH}. Logging into error_log.LOG. Create a issue regarding this.") if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.open("#{PATH}/log/error_log.LOG", 'a+') { |s| s.puts("No sites found with search query #{SEARCH}") } if File.size("#{PATH}/tmp/SQL_sites_to_check.txt") == 0
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG")
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:file]
begin
Whitewidow.spider
sleep(1)
Credits.credits
sleep(1)
Legal.legal
Format.info('Formatting file')
format_file
vulnerability_check
File.truncate("#{PATH}/tmp/SQL_sites_to_check.txt", 0)
Format.info("I'm truncating SQL_sites_to_check file back to #{File.size("#{PATH}/tmp/SQL_sites_to_check.txt")}")
Copy.file("#{PATH}/tmp/SQL_VULN.txt", "#{PATH}/log/SQL_VULN.LOG")
File.truncate("#{PATH}/tmp/SQL_VULN.txt", 0)
Format.info("I've run all my tests and queries, and logged all important information into #{PATH}/log/SQL_VULN.LOG") unless File.size("#{PATH}/log/SQL_VULN.LOG") == 0
rescue Mechanize::ResponseCodeError, RestClient::ServiceUnavailable, OpenSSL::SSL::SSLError, RestClient::BadGateway => e
d = DateTime.now
Format.fatal("Well this is pretty crappy.. I seem to have encountered a #{e} error. I'm gonna take the safe road and quit scanning before I break something. You can either try again, or manually delete the URL that caused the error.")
File.open("#{PATH}/log/error_log.LOG", 'a+'){ |error| error.puts("[#{d.month}-#{d.day}-#{d.year} :: #{Time.now.strftime("%T")}]#{e}") }
Format.info("I'll log the error inside of #{PATH}/log/error_log.LOG for further analysis.")
end
when OPTIONS[:example]
examples_page
else
Format.warning('You failed to pass me a flag!')
usage_page
end
IS there anything within this code, that would cause this to randomly popup? It only happens with random search queries.
Link to GitHub
UPDATE:
Ive discovered that Googles advertisement services link has the same extension in its URL as the one giving me problems.. However this doesn't explain why I'm getting this link, and why I can't seem to skip over it.
urls = "settings/ads/preferences?hl=ru"
if urls =~ /settings\/ads\/preferences\?hl=[a-z]{2}/
p "I'm skipped"
end
=> "I'm skipped"

Global variable before and after `end`

I am parsing a PDF file online in order to extract a text. The 2 completes codes:
First
require 'open-uri'
require "net/http"
require 'pdf/reader'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
io = open('https://www.mtholyoke.edu/sites/default/files/registrar/bulletin/docs/dept_econ.pdf')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
iso = page.text
$var = iso.scan(/Economics[\s\S]*Overview/)
p $var
end
Second
require 'open-uri'
require "net/http"
require 'pdf/reader'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
io = open('https://www.mtholyoke.edu/sites/default/files/registrar/bulletin/docs/dept_econ.pdf')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
iso = page.text
$var = iso.scan(/Economics[\s\S]*Overview/)
end
p $var
It appears that when I use p $var after end, I have truncated the result unlike the first code. Why does putting p $var after end give a different result from putting it before?
In my web app, I do need to do put it after the end and have the same result as the first code. How can I do so?
tmp = reader.pages.map { |p| p.text.scan(/Economics[\s\S]*Overview/) }
tmp now contains a collection of all the scan results.
puts tmp.join("\n")
Will print them all out with newlines between each match.
Although won't that just print a wad of "Economics Overview"?
If you want to collect the pages themselves it's different code.

Why doesn't my web-crawling method find all the links?

I'm trying to create a simple web-crawler, so I wrote this:
(Method get_links take a parent link from which we will seek)
require 'nokogiri'
require 'open-uri'
def get_links(link)
link = "http://#{link}"
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
array = hrefs.select {|i| i[0] == "/"}
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
end
(Method search_links, takes an array from get_links method and search at this array)
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
end
This method finds most of links from the website, but not all.
What did I do wrong? Which algorithm I should use?
Some comments about your code:
def get_links(link)
link = "http://#{link}"
# You're assuming the protocol is always http.
# This isn't the only protocol on used on the web.
doc = Nokogiri::HTML(open(link))
links = doc.css('a')
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.delete_if {|href| href.empty?}
# You can write these two lines more compact as
# hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
array = hrefs.select {|i| i[0] == "/"}
# I guess you want to handle URLs that are relative to the host.
# However, URLs relative to the protocol (starting with '//')
# will also be selected by this condition.
host = URI.parse(link).host
links_list = array.map {|a| "#{host}#{a}"}
# The value assigned to links_list will implicitly be returned.
# (The assignment itself is futile, the right-hand-part alone would
# suffice.) Because this builds on `array` all absolute URLs will be
# missing from the return value.
end
Explanation for
hrefs = doc.xpath('//a/#href').map(&:to_s).uniq.delete_if(&:empty?)
.xpath('//a/#href') uses the attribute syntax of XPath to directly get to the href attributes of a elements
.map(&:to_s) is an abbreviated notation for .map { |item| item.to_s }
.delete_if(&:empty?) uses the same abbreviated notation
And comments about the second function:
def search_links(urls)
urls = get_links(link)
urls.uniq.each do |url|
begin
links = get_links(url)
compare = urls & links
urls << links - compare
urls.flatten!
# How about using a Set instead of an Array and
# thus have the collection provide uniqueness of
# its items, so that you don't have to?
rescue OpenURI::HTTPError
warn "Skipping invalid link #{url}"
end
end
return urls
# This function isn't recursive, it just calls `get_links` on two
# 'levels'. Thus you search only two levels deep and return findings
# from the first and second level combined. (Without the "zero'th"
# level - the URL passed into `search_links`. Unless off course if it
# also occured on the first or second level.)
#
# Is this what you intended?
end
You should probably be using mechanize:
require 'mechanize'
agent = Mechanize.new
page = agent.get url
links = page.search('a[href]').map{|a| page.uri.merge(a[:href]).to_s}
# if you want to remove links with a different host (hyperlinks?)
links.reject!{|l| URI.parse(l).host != page.uri.host}
Otherwise you'll have trouble converting relative urls to absolute properly.

Scraping issue with Nokogiri

I am trying to write a simple script that will tell me when the next episode of x show will be released.
here is what I have so far:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
puts doc.at_css('h1').text
airdate = doc.at_css('.highlight_date span , h1').text
date = /\W/.match(airdate)
puts date
when i run this all it returns is:
Game of thrones
The css selector I use there gives the line airdate is /xx/xx/xx, however I only want to the date so thats why I have used the /\W/ although I could be completely wrong here.
So basically I want it to just print the show title and the date of the next episode.
You can do as below :-
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7
# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
# ["The Laws of God and Men", "5/11/14"],
# ["First of His Name", "5/4/14"],
# ["Oathkeeper", "4/27/14"],
# ["Breaker of Chains", "4/20/14"],
# ["The Lion and the Rose", "4/13/14"],
# ["Two Swords", "4/6/14"]]
Looking at the webpage again, I can see, that it always open with latest season's data. Thus the above code can be modified as below :-
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4
# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node|
p [node.css('.title').text,node.css('.date').text]
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]
As per the comment, it seems OP may interested to get the data out from NEXT EPISODE box of the webpage. Here is a way to do the same :
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
hash = {}
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node|
hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
hash['title'] = node.css('div.highlight_name > a').text
end
hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}
Worth to read tap{|x|...} → obj
Yields x to the block, and then returns x. The primary purpose of this method is to “tap into” a method chain, in order to perform operations on intermediate results within the chain.
and str[regexp] → new_str or nil.
Also read CSS selectors to understand how the selectors are with the method #css.

Resources