Creating a file for each URL of a site using Ruby - ruby

I have the following code, which creates a file with the content from a crawled site:
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_pages_like(/http:\/\/www.findbrowsenodes.com\/us\/.+\/[\d]*/) do | page |
doc = Nokogiri::HTML(open(page.url))
node_id = doc.at_css("#n_info #clipnode").text unless doc.at_css("#n_info #clipnode").nil?
node_name = doc.at_css("#n_info .node_name").text unless doc.at_css("#n_info .node_name").nil?
node_url = page.url
open("filename.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
end
Now I want to create not one but various files named node_id. I tried this:
page.each do |p|
p.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
but got this:
undefined method `value' for #<Nokogiri::XML::DTD:0x51c089a name="html"> (NoMethodError)
then tried this:
page.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
but got this:
private method `open' called for #<Anemone::Page:0x91472e8> (NoMethodError)
What's the right way of doing this?

File.open("#{node_id}.txt", "w") do |f|
f.puts "stuff"
end
How you make the assignment to node_id is up to you.

Related

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

How to scrape data from list of URLs and save data to CSV with nokogiri

I have a file called bontyurls.csv that looks like this:
http://bontrager.com/model/11383
http://bontrager.com/model/01740
http://bontrager.com/model/09595
I want my script to read that file and then spit out a file like this: bonty_test_urls_results.csv
url,model_names
http://bontrager.com/model/11383,"Road TLR Conversion Kit"
http://bontrager.com/model/01740,"404 File Not Found"
http://bontrager.com/model/09595,"RXL Road"
Here's what I've got so far:
# based on code from here: http://www.andrewsturges.com/2011/09/how-to-harvest-web-data-using-ruby-and.html
require 'nokogiri'
require 'open-uri'
require 'csv'
#urls = Array.new
#model_names = Array.new
urls = CSV.read("bontyurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
doc.xpath('//h1').each do |model_name|
#model_name << model_name.content
end
end
# write results to file
CSV.open("bonty_test_urls_results.csv", "wb") do |row|
row << ["url", "model_names"]
(0..#urls.length - 1).each do |index|
row << [
#urls[index],
#model_names[index]]
end
end
That code isn't working. I'm getting this error:
$ ruby bonty_test_urls.rb
http://bontrager.com/model/00310
bonty_test_urls.rb:15:in `block (2 levels) in <main>': undefined method `<<' for nil:NilClass (NoMethodError)
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
from bonty_test_urls.rb:14:in `block in <main>'
from bonty_test_urls.rb:11:in `each'
from bonty_test_urls.rb:11:in `<main>'
Here is some code that returns the model_name at least. I'm just having trouble getting it to work in the larger script:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://bontrager.com/model/09124"))
doc.xpath('//h1').each do |node|
puts node.text
end
Also, I haven't figured out how to handle the URLs that return a 404.
This is how I'd do it:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url model_names]
}
CSV.open('bonty_test_urls_results.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('bontyurls.csv') do |url|
url.chomp!
begin
doc = Nokogiri.HTML(open(url))
h1 = doc.at('h1').text.strip
h1 = doc.at('title').text.strip.sub(/^Bontrager: /i, '') if (h1.empty?)
csv << [url, h1]
rescue OpenURI::HTTPError => e
csv << [url, e.message]
end
end
end
Which generates a CSV file like:
url,model_names
http://bontrager.com/model/11383,Road TLR Conversion Kit (Model #11383)
http://bontrager.com/model/01740,404 File Not Found
http://bontrager.com/model/09595,RXL Road (Model #09595)
You declare #model_names, but try to push in to #model_name, which is why it's nil.

Ruby scraper. How to export to CSV?

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown:
scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError)
I do not understand this piece of code. What's this doing and why isn't it working right?
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
Full code:
#!/usr/bin/ruby
require 'rubygems'
require 'anemone'
require 'fastercsv'
productsArray = Array.new
class Product
attr_accessor :name, :sku, :desc
end
# Scraper Code
Anemone.crawl("http://retail.pelicanbayltd.com/") do |anemone|
anemone.on_every_page do |page|
currentPage = Product.new
#Product info parsing
currentPage.name = page.doc.css(".page_headers").text
currentPage.sku = page.doc.css("tr:nth-child(2) strong").text
currentPage.desc = page.doc.css("tr:nth-child(4) .item").text
if currentPage.sku =~ /#\d\d\d\d/
currentPage.sku = currentPage.sku[1..-1]
productsArray.push(currentPage)
end
end
end
# CSV Export Code
products = productsArray.find(:all)
csv_data = FasterCSV.generate do |csv|
# header row
csv << ["sku", "name", "desc"]
# data rows
productsArray.each do |product|
csv << [product.sku, product.name, product.desc]
end
end
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
If you are new to Ruby, you should be using Ruby 1.9 or later, in which case you can use the built-in CSV output which builds in fast csv plus l18n support:
require 'csv'
CSV.open('filename.csv', 'w') do |csv|
csv << [sku, name, desc]
end
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html
File.open('filename.csv', 'w') do |f|
f.write(csv_data)
end
It probably makes more sense to do:
#csv = FasterCSV.open('filename.csv', 'w')
and then write to it as you go along:
#csv << [sku, name, desc]
that way if your script crashes halfway through you've at least got half of the data.

Save Webscraped data

I am trying to scrape a website. I am able to scrape data from that website. I am having trouble saving the data from the scrape to yaml file that I have included
My Code:
require 'rubygems'
require 'open-uri'
require 'hpricot'
article = []
doc = open("http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"{|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
puts "#{article.inner_html}"
end
File.open('test.yaml', 'w') { |f|
f <<article.to_yaml
}
First you are missing a closing parenthesis for the open call (a ) right before the block starts).
When you add that you'll notice that you'll get a NoMethodError (undefined method 'to_yaml' for []:Array). To fix that you have to require 'yaml', which pulls in the monkey-patches for the Array class. After that you'll notice that your yaml file is empty, because you never put anything into article. Here's a fixed version:
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'yaml'
articles = []
url = "http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"
doc = open(url) {|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
articles << article.inner_html
end
File.open('test.yaml', 'w') { |f| f << articles.to_yaml }

Resources