Proper way of reading a page and saving it into a html file? - ruby

I have the following:
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_pages_like(/http:\/\/www.findbrowsenodes.com\/us\/.+\/[\d]*/) do | page |
doc = Nokogiri::HTML(open(page.url))
id = doc.at_css("#n_info #clipnode").text unless doc.at_css("#n_info #clipnode").nil?
File.open("#{node_id}.html", "wb") do |f|
f.write(open(page).read)
end
end
end
So I'm trying to save each URL as a html file with this:
File.open("#{id}.html", "wb") do |f|
f.write(open(page).read)
end
But I get this error:
alex#alex-K43U:~/rails/anemone$ ruby anemone.rb
/home/alex/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/open-uri.rb:35:in
open': can't convert Anemone::Page into String (TypeError) from
/home/alex/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/open-uri.rb:35:in
open' from anemone.rb:27:in block (3 levels) in <main>' from
anemone.rb:26:inopen' from anemone.rb:26:in `block (2 levels) in
'
What's the right way of doing this?

There are several problems / confusions:
As the error says, the open methods expects a String (i.e. the url), but you're providing an Anemone::Page object.
This object has a url method, which you already use on line 9.
On line 9: open(page.url)
You're already opening the page, so you could reuse that. But:
According to the docs http://anemone.rubyforge.org/doc/classes/Anemone/Page.html Anemone::Page contains a body method that may already contain the content (I'm just guessing, haven't use or tried that library). If that's the case, there's no need to use open.
As I see it, the following untested code may be more like what you're looking for:
doc = Nokogiri::HTML(page.body)
# [snip]
File.open("#{node_id}.html", "wb") do |f|
f.write(page.body)
end

Related

efficiently processing large quantity of files in Ruby

I am writing a script and I need to traverse a file system, and return the SHA1 sum of the files.
The code I am using is this:
time ruby -r'digest/sha1' -r'find' -e 'Find.find("/") {|x| next unless File.file?(x) ; Digest::SHA1.hexdigest(File.read(x))}
The problem is, I get this error message after about 5 seconds after execution
-e:1:in `read': failed to allocate memory (NoMemoryError)
from -e:1:in `open'
from -e:1:in `block in <main>'
from /usr/share/ruby/find.rb:43:in `block in find'
from /usr/share/ruby/find.rb:42:in `catch'
from /usr/share/ruby/find.rb:42:in `find'
from -e:1:in `<main>'
Why am I getting this error, and what is the "best practice" for handling a task like this?
Help appreciated.
It doesn't seem to be well documented (or at least, I'm not looking in the write place) but the Digest library provides a way of hashsumming files by reading the files in chunks and computing the hashsum, versus File.read which reads the whole file into memory.
The working code would be:
begin
Find.find("/") do |file|
next unless File.file?(file)
puts "#{Digest::SHA1.file(file)} #{file}"
end
rescue => e
puts e
end
Why make it difficult by putting this in a one-liner ?
If you put your code in a script like this, on my system everyting runs smooth and every file on my HD is read.
On a data disk you'rd better find a way to handle large files, like the solution at https://www.ruby-forum.com/topic/58563 I adapted for SHA1.
require 'digest/sha1'
require 'find'
Find.find("/") do |file|
next unless File.file?(file)
begin
sha = File.open(file, 'rb') do |io|
dig = Digest::SHA1.new
buf = ""
dig.update(buf) while io.read(4096, buf)
dig
end
puts "#{sha} #{file}"
rescue => e
puts e.backtrace
end
end
gives
ba4aeced8ab461b75ff87d989ff16ca2464ea787 /$AVG/$VAULT/vault.db
31d8730390451d236b80c4351b6b287d6853570c /$AVG/$VAULT/vvfolder.idx
b4c783e3478e5b6f795e92d3cf5d85837fffd128 /$Recycle.Bin/S-1-5-21-50811273-296787125-2640436092-1000/desktop.ini
b4c783e3478e5b6f795e92d3cf5d85837fffd128 /$Recycle.Bin/S-1-5-21-50811273-296787125-2640436092-1011/desktop.ini
3109805dcc447395f58fec8b5e8a8fca1d20892b /.rnd
61fc34796b7cc67caf9da685e59461c9d13fba29 /4nt500/4NT.INI
...

Cannot load data file in Sinatra

I created the following parser:
require "./artist"
require "./song"
require "./genre"
require "debugger"
class Parser
attr_accessor :artists, :genres, :song
attr_reader :mp3
REGEX = /(?<artist>.*)\s\-\s(?<song>.*)\s\[(?<genre>.*)\]/
def initialize(directory="data")
debugger
#mp3 = Dir.entries(directory).select {|f| !File.directory? f}
debugger
end
def parse
#mp3.map do |file|
match = REGEX.match(file)
artist = Artist.find_by_name(match[:artist]) || Artist.new.tap {|artist| artist.name = match[:artist]}
song = Song.new
song.name = match[:song]
song.genre = Genre.find_by_name(match[:genre]) || Genre.new.tap {|genre| genre.name = match[:genre]}
#debugger
artist.add_song(song)
end
end
end
a = Parser.new.parse
I tried running it by calling parser.rb in the directory, lib, where it is located. I get the following error messages:
Parser.rb:47:in `open': No such file or directory - data (Errno::ENOENT)
from parser.rb:47:in `entries'
from parser.rb:47:in `initialize'
from parser.rb:68:in `new'
from parser.rb:68:in `<main>'
This is my file structure:
Can anyone please tell me why it cannot recognize my data directory? I have been staring for a while now and cannot figure it out. It was working like 10 mins ago and I cannot remember what I change to get it all messed up.
Appreciate your feedback! Thanks
You should be able to run your example like ruby -I/lib lib/parser.rb from the directory above lib. The -I will set the "include path", such that the ruby interpreter will find the other required ruby files like (lib/)song.rb.

Why is Mechanize returning "undefined method 'value=' for nil:NilClass" when trying to set a password?

I wrote a script with Mechanize to scrape some links, which later I will write code to put into an Excel file.
For now I can't authenticate past the first page. I keep getting an undefined method value= for nil:NilClass when attempting to set the password in the form and haven't been able to find any information on it.
I don't even have the method value= in my code so I don't understand what is going on. The code runs fine for the username, but once I enter the password and hit enter I get the error:
users.rb:11:in `block (2 levels) in <main>': undefined method `value=' for nil:NilClass (NoMethodError)
from (eval):23:in `form_with'
from formity_users.rb:7:in `block in <main>'
from /home/codelitt/.rvm/gems/ruby-2.0.0-p247/gems/mechanize-2.7.1/lib/mechanize.rb:433:in `get'
from formity_users.rb:5:in `<main>'
This is my users.rb script:
require 'rubygems'
require 'mechanize'
a = Mechanize.new
a.get('https://www.example.com') do |page|
#Enter information into forms
logged_in = page.form_with(:id => 'frmLogin') do |f|
puts "Username?"
f.field_with(:name => "LoginCommand.EmailAddress").value = gets.chomp
puts "Password?"
f.field_with(:name => "Login.Password").value = gets.chomp
end.click_button
#Click drop down
admin_page = logged_in.click.link_with(:text => /Admin/)
#Click Users and enter user admin section
user_admin = admin_page.click.link_with(:text => /Users/)
#Scrape and print links for now
user_admin.links.each do |link|
text = link.text.strip
next unless text.length > 0
puts text
end
end
I think your error is coming from
f.field_with(:name => "Login.Password")
which seems to be nil. For username, I see that you have specified input name LoginCommand.EmailAddress and for password input name is Login.Password.
I'd expect anybody who has written this markup to use consistent names. Maybe you should look that the underlying html to see you're using correct field names in your code.

How to scrape data from list of URLs and save data to CSV with nokogiri

I have a file called bontyurls.csv that looks like this:
http://bontrager.com/model/11383
http://bontrager.com/model/01740
http://bontrager.com/model/09595
I want my script to read that file and then spit out a file like this: bonty_test_urls_results.csv
url,model_names
http://bontrager.com/model/11383,"Road TLR Conversion Kit"
http://bontrager.com/model/01740,"404 File Not Found"
http://bontrager.com/model/09595,"RXL Road"
Here's what I've got so far:
# based on code from here: http://www.andrewsturges.com/2011/09/how-to-harvest-web-data-using-ruby-and.html
require 'nokogiri'
require 'open-uri'
require 'csv'
#urls = Array.new
#model_names = Array.new
urls = CSV.read("bontyurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
doc.xpath('//h1').each do |model_name|
#model_name << model_name.content
end
end
# write results to file
CSV.open("bonty_test_urls_results.csv", "wb") do |row|
row << ["url", "model_names"]
(0..#urls.length - 1).each do |index|
row << [
#urls[index],
#model_names[index]]
end
end
That code isn't working. I'm getting this error:
$ ruby bonty_test_urls.rb
http://bontrager.com/model/00310
bonty_test_urls.rb:15:in `block (2 levels) in <main>': undefined method `<<' for nil:NilClass (NoMethodError)
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
from bonty_test_urls.rb:14:in `block in <main>'
from bonty_test_urls.rb:11:in `each'
from bonty_test_urls.rb:11:in `<main>'
Here is some code that returns the model_name at least. I'm just having trouble getting it to work in the larger script:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://bontrager.com/model/09124"))
doc.xpath('//h1').each do |node|
puts node.text
end
Also, I haven't figured out how to handle the URLs that return a 404.
This is how I'd do it:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url model_names]
}
CSV.open('bonty_test_urls_results.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('bontyurls.csv') do |url|
url.chomp!
begin
doc = Nokogiri.HTML(open(url))
h1 = doc.at('h1').text.strip
h1 = doc.at('title').text.strip.sub(/^Bontrager: /i, '') if (h1.empty?)
csv << [url, h1]
rescue OpenURI::HTTPError => e
csv << [url, e.message]
end
end
end
Which generates a CSV file like:
url,model_names
http://bontrager.com/model/11383,Road TLR Conversion Kit (Model #11383)
http://bontrager.com/model/01740,404 File Not Found
http://bontrager.com/model/09595,RXL Road (Model #09595)
You declare #model_names, but try to push in to #model_name, which is why it's nil.

Ruby 1.9.2 - Read and parse a remote CSV

I am looking for a way to read and parse locally a remote CSV (hosted on a particular website).
I found on the Internet a couple of interesting examples that make use of FasterCSV, that in ruby 1.9.2 has been merged into CSV. I found that you can read a remote CSV using the gems 'csv' and 'open-uri' this way:
require 'csv'
require 'open-uri'
def read(url)
open(url) do |f|
f.each_line do |l|
CSV.parse(l) do |row|
puts row
end
end
end
end
But when I call this function, I get an exception:
ERROR IOError: closed stream
Anyone can explain me why? Is there anything wrong? Should I choose another approach for reading remote CSV's?
Update
The best solution I've found till now is this:
def read(url)
data = []
begin
open(url) do |f|
data = CSV.parse f
end
rescue IOError => e
# Silently catch the exception ...
end
return data
end
but it somewhat seems not so clean. I really do not like silently catching an exception where it shouldn't be ...
Update 2
I can reproduce the error using both
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
and
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.0]
This is the code from my test.rb file:
require 'rubygems'
require 'open-uri'
require 'csv'
def read(url)
data = []
begin
open(url) do |f|
data = CSV.parse f
end
end
puts data
end
read("http://www.euribor-ebf.eu/assets/modules/rateisblue/processed_files/myav_EURIBOR_2011.csv")
And this is the output of the ruby test.rb command
/Users/marzu/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/open-uri.rb:152:in `close': closed stream (IOError)
from /Users/marzu/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/open-uri.rb:152:in `open_uri'
from /Users/marzu/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/open-uri.rb:671:in `open'
from /Users/marzu/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/open-uri.rb:33:in `open'
from test.rb:8:in `read'
from test.rb:16:in `<main>'
I am using rvm 1.6.9 on Mac OS X 10.6.7.
Any suggestions?
On Mac OS X 10.6.7, using ruby r1.9.2, I get the same error as displayed above. But using the following code to read CSV files works for the example URL provided:
require 'rubygems'
require 'open-uri'
require 'csv'
def read(url)
CSV.new(open(url), :headers => :first_row).each do |line|
puts line
puts line[0]
puts line['FEB11']
end
end
read("http://www.euribor-ebf.eu/assets/modules/rateisblue/processed_files/myav_EURIBOR_2011.csv")

Resources