I am parsing a PDF file online in order to extract a text. The 2 completes codes:
First
require 'open-uri'
require "net/http"
require 'pdf/reader'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
io = open('https://www.mtholyoke.edu/sites/default/files/registrar/bulletin/docs/dept_econ.pdf')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
iso = page.text
$var = iso.scan(/Economics[\s\S]*Overview/)
p $var
end
Second
require 'open-uri'
require "net/http"
require 'pdf/reader'
module OpenSSL
module SSL
remove_const :VERIFY_PEER
end
end
OpenSSL::SSL::VERIFY_PEER = OpenSSL::SSL::VERIFY_NONE
io = open('https://www.mtholyoke.edu/sites/default/files/registrar/bulletin/docs/dept_econ.pdf')
reader = PDF::Reader.new(io)
reader.pages.each do |page|
iso = page.text
$var = iso.scan(/Economics[\s\S]*Overview/)
end
p $var
It appears that when I use p $var after end, I have truncated the result unlike the first code. Why does putting p $var after end give a different result from putting it before?
In my web app, I do need to do put it after the end and have the same result as the first code. How can I do so?
tmp = reader.pages.map { |p| p.text.scan(/Economics[\s\S]*Overview/) }
tmp now contains a collection of all the scan results.
puts tmp.join("\n")
Will print them all out with newlines between each match.
Although won't that just print a wad of "Economics Overview"?
If you want to collect the pages themselves it's different code.
Related
I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email address on the first iteration of the first loop.
For some reason, my regex does not seem to be matching:
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
#urlCounter, #anemoneCounter = 0
$allUrls, $emailUrls, $emails = []
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
matchOrNil = body_text.match(/\A[^#\s]+#[^#\s]+\z/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = body_text.match
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]
So here is the working version of the code. Uses a single regex to find a string containing an email and three more to clean it.
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
#urlCounter = 0
$anemoneCounter = 0
$allUrls = []
$emailUrls = []
$emails = []
end
def email_clean(email)
email = email.gsub(/(\w+=)/,"")
email = email.gsub(/(\w+:)/, "")
email = email.gsub!(/\A"|"\Z/, '')
return email
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
#matchOrNil = body_text.match(/\A[^#\s]+#[^#\s]+\z/)
matchOrNil = body_text.match(/[^#\s]+#[^#\s]+/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = matchOrNil
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
found_email = "href=\"mailto:genuinestoragesheds#gmail.com\""
puts emailGrab.email_clean(found_email)
\A and \z in your match beginning and end of string respectively. Obviously that webpage contains more that just an email string, or you wound't do the regex test at all.
You can simplify it to just /[^#\s]+#[^#\s]+/, but you would still need to cleanup the string the extract the email.
I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....
I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.
Is there a 'better/alternative/more readable' solution to http://www.pythonchallenge.com/pc/def/linkedlist.php than the ruby code below?
I can understand it but it just doesn't seem very clear to me...
#!/usr/bin/env ruby -w
# encoding: utf-8
require 'net/http'
Net::HTTP.start 'www.pythonchallenge.com' do |http|
nothing = 12345
nothing = case http.get("/pc/def/linkedlist.php?nothing=#{nothing}").body
when /(\d+)$/ then $1.to_i
when /by two/ then nothing / 2
when /\.html/ then puts $`
end while nothing
end
It was ok but let's try to make it a little more readable:
#!/usr/bin/env ruby -w
# encoding: utf-8
require 'net/http'
Net::HTTP.start 'www.pythonchallenge.com' do |http|
next_one = 12345
while next_one
response = http.get("/pc/def/linkedlist.php?nothing=#{next_one}").body
case response
when /Divide by two and keep going./ then
next_one = next_one / 2
when /and the next nothing is (\d+)/ then
next_one = $1.to_i
else
puts "Solution url: www.pythonchallenge.com/pc/def/linkedlist.php?nothing=#{next_one}"
next_one = false
end
end
end
Can anybody help me to
get the file size before I start downloading
display how much % was already downloaded
.
require 'net/http'
require 'uri'
url = "http://www.onalllevels.com/2009-12-02TheYangShow_Squidoo_Part 1.flv"
url_base = url.split('/')[2]
url_path = '/'+url.split('/')[3..-1].join('/')
Net::HTTP.start(url_base) do |http|
resp = http.get(URI.escape(url_path))
open("test.file", "wb") do |file|
file.write(resp.body)
end
end
puts "Done."
Use the request_head method. Like this
response = http.request_head('http://www.example.com/remote-file.ext')
file_size = response['content-length']
The file_size will be in bytes.
Follow these two links for more info.
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M000695
http://curl.haxx.se/mail/archive-2002-07/0070.html
so I made it work even with the progress bar ....
require 'net/http'
require 'uri'
require 'progressbar'
url = "url with some file"
url_base = url.split('/')[2]
url_path = '/'+url.split('/')[3..-1].join('/')
#counter = 0
Net::HTTP.start(url_base) do |http|
response = http.request_head(URI.escape(url_path))
ProgressBar#format_arguments=[:title, :percentage, :bar, :stat_for_file_transfer]
pbar = ProgressBar.new("file name:", response['content-length'].to_i)
File.open("test.file", 'w') {|f|
http.get(URI.escape(url_path)) do |str|
f.write str
#counter += str.length
pbar.set(#counter)
end
}
end
pbar.finish
puts "Done."
The file size is available in the HTTP Content-Length response header. If it is not present, you can't do anything. To calculate the %, just do the primary school math like (part/total * 100).
Here the full code to get file details before download
require 'net/http'
response = nil
uri = URI('http://hero.com/abc.mp4')
Net::HTTP.start(uri.host, uri.port) do |http|
response = http.head(uri)
end
response.header.each_header {|key,value| puts "#{key} = #{value}" }