open-uri Ruby Errors - ruby

I have the code:
require 'open-uri'
print "Enter a URL: "
add = gets
added = add.sub!(/http:\/\//, "")
puts "Info from: #{add}"
open("#{add}") do |f|
img = f.read.scan(/<img/)
img = img.length
puts "\t#{img} images"
f.close
end
open("#{add}") do |f|
links = f.read.scan(/<a/)
links = links.length
puts "\t#{links} links"
f.close
end
open("#{add}") do |f|
div = f.read.scan(/<div/)
div = div.le1ngth
puts "\t#{div} div tags"
f.close
end
(Yes I know it isn't good code, don't comment about it please)
When I run it, and for the URL, I enter in, say:
http://stackoverflow.com
I get the following error:
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `initialize': No such file or directory - http (Errno::ENOENT)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `open_uri_original_open'
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:32:in `open'
Why does this error come up and how can I fix it?

The String.sub! method replaces the string in place, so add.sub!(/http:\/\//, "") changes the value of add in addition to setting added.
To use the open(name) method with URIs, the value of name must start with a URI scheme, like http://.
If you want to set added, do so like so:
added = add.sub(/http:\/\//, "")

Related

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

Not extracting the full link using index

I'm trying to extract the first href link from a website. Just the full link alone.
I am expecting to get http://www.iana.org/domains/example as the output but instead I am getting just http://www.iana.org/domains/ex
require 'net/http'
source = Net::HTTP.get('www.example.org', '/index.html')
def findhref(page) #returns rest of the html after href
return page[page.index('href')..-1]
end
def findlink(page)
text = findhref(page)
firstquote = text.index('"') #first position of quote
secondquote = text[firstquote+1..-1].index('"') #2nd quote
puts text #for debugging
puts firstquote+1 #for debugging
puts secondquote #for debugging
return text[firstquote+1..secondquote]
end
print findlink(source)
I would suggest using Nokogiri for HTML parsing. The solution to your problem would be as simple as:
doc = Nokogiri::HTML(open('www.example.org/index.html'))
first_anchor = doc.css('a').first
first_href = first_anchor['href']

Download a file only if it exists with ruby

I'm doing a scraper to download all the issues of The Exile available at http://exile.ru/archive/list.php?IBLOCK_ID=35&PARAMS=ISSUE.
So far, my code is like this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
puts "Downloading issue #{number}"
open(numero) { |f|
File.open("#{DATA_DIR}/#{number}.pdf",'w') do |file|
file.puts f.read
end
}
end
puts "done"
The thing is, a lot of the issue links are down, and the code creates a PDF for every issue and, if it's empty, it will leave an empty PDF. How can I change the code so that it can only create and copy a file if the link exists?
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
url_template = "http://exile.ru/docs/pdf/issues/exile%d.pdf"
filename_template = "#{DATA_DIR}/%d.pdf"
(120..290).each do |number|
pdf_url = url_template % number
print "Downloading issue #{number}"
# Opening the URL downloads the remote file.
open(pdf_url) do |pdf_in|
if pdf_in.read(4) == '%PDF'
pdf_in.rewind
File.open(filename_template % number,'w') do |pdf_out|
pdf_out.write(pdf_in.read)
end
print " OK\n"
else
print " #{pdf_url} is not a PDF\n"
end
end
end
puts "done"
open(url) downloads the file and provides a handle to a local temp file. A PDF starts with '%PDF'. After reading the first 4 characters, if the file is a PDF, the file pointer has to be put back to the beginning to capture the whole file when writing a local copy.
you can use this code to check if exist the file:
require 'net/http'
def exist_the_pdf?(url_pdf)
url = URI.parse(url_pdf)
Net::HTTP.start(url.host, url.port) do |http|
puts http.request_head(url.path)['content-type'] == 'application/pdf'
end
end
Try this:
require 'rubygems'
require 'open-uri'
DATA_DIR = "exile"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_exile_URL = "http://exile.ru/docs/pdf/issues/exile"
for number in 120..290
numero = BASE_exile_URL + number.to_s + ".pdf"
open(numero) { |f|
content = f.read
if content.include? "Link is missing"
puts "Issue #{number} doesnt exists"
else
puts "Issue #{number} exists"
File.open("./#{number}.pdf",'w') do |file|
file.write(content)
end
end
}
end
puts "done"
The main thing I added is a check to see if the string "Link is missing". I wanted to do it using HTTP status codes but they always give a 200 back, which is not the best practice.
The thing to note is that with my code you always download the whole site to look for that string, but I don't have any other idea to fix it at the moment.

Finding path between pages on wiki doesn't work with larger paths

I am trying to make a program which taken an input wikipedia link and clicks on the first link. The program will continue to run until it matches the second input. I will eventually add in functionality to terminate the program when it hits a loop.
Right now my code is working for examples with only a few links such as Bee -> History but gives me an error for longer paths. Here is the code, I would appreciate any input I just started to study ruby yesterday and likely have mistakes.
require 'open-uri'
require 'nokogiri'
puts "Enter starting page (full URL not needed): "
page1 = gets.chomp
puts "Enter ending page (full URL not needed): "
page2 = gets.chomp
until page1 == page2 do
#open page
doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/" + page1))
%w[.//table .//span .//sup .//i].map {|n| doc.xpath(n).map(&:remove) }
#find href in first p
fp = doc.css("p").first.search('a').map{ |a| a['href']}
#make page1 = the end of the url. ex. /wiki/link = link
page1 = fp.first[6,fp.first.length]
puts page1
end
Updated: Here is the error I am getting:
C:\Users\files>ruby 121.rb
Enter starting page (full URL not needed):
Cow
Enter ending page (full URL not needed):
Philosophy
Domestication
Latin_(language)
Classical_antiquity
History
121.rb:20:in `<main>': undefined method `length' for nil:NilClass (NoMethodError
)
Also, for solve your task, you can treat all links on the page to achieve page2:
require 'open-uri'
require 'nokogiri'
puts "Enter starting page (full URL not needed): "
start_page = gets.chomp
puts "Enter ending page (full URL not needed): "
end_page = gets.chomp
pages = [start_page]
next_page = pages.first
until next_page == end_page or pages.empty? do
next_page = pages.pop
puts "Treat: #{next_page}"
doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/" + next_page))
%w[.//table .//span .//sup .//i].map {|n| doc.xpath(n).map(&:remove) }
doc.css("p").each do |p|
p.search('a').each{ |a| pages.push a['href'][6, a['href'].length]}
end
end

Writing to a file then trying to open it again for parsing

I'm trying to save the xml feed of a twitter user to a file and then try to read it again for parsing onto the screen.
This s what I see hen I try to run it..
Wrote to file #<File:0x000001019257c8>
Now parsing user info..
twitter_stats.rb:20:in `<main>': undefined method `read' for "keva161.txt":String (NoMethodError)
Here's my code...
require "open-uri"
require "rubygems"
require "crack"
twitter_url = "http://api.twitter.com/1/statuses/user_timeline.xml?cout=100&screen_name="
username = "keva161"
full_page = twitter_url + username
local_file = username + ".txt"
tweets = open(full_page).read
my_local_file = open(local_file, "w")
my_local_file.write(tweets)
puts "Wrote to file " + my_local_file.to_s
sleep(1)
puts "Now parsing user info.."
sleep(1)
parsed_xml = Crack::XML.parse(local_file.read)
tweets = parsed_xml["statuses"]
first_tweet = tweets[0]
user = first_tweets["user"]
puts user["screen_name"]
puts user ["name"]
puts users ["created_at"]
puts users ["statuses_count"]
You are calling read on local_file, which is the string containing the filename. You meant to type my_local_file.read, I guess, to use the IO object you got from open. (...or File.read local_file.)
Not that this is the best form: why are you writing to a temporary file anyhow? You have the data in memory, so just pass it directly.
If you do want to write to a local file, I commend the block from of open:
open(local_file, 'w') do |fh|
fh.print ...
end
That way Ruby will take care of closing the file for you and all that.

Resources