Getting all unique URL's using nokogiri - ruby

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end

Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

Related

undefined method each for nil class in nokigiri use

i am trying to fetch all links on the given link but it is giving me an error undefined method `each' for nil:NilClass
require 'nokogiri'
def find_links(link)
page = Nokogiri::HTML(open(link))
link_size = page.css('li')
(0..link_size.length).each do |index|
b = link_size[index]['href']
return b
end
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869').each do |url|
puts url
end
There are couple of issues in your code. Find explanation inline below:
def find_links(link)
page = Nokogiri::HTML(open(link))
link_size = page.css('li')
(0..link_size.length).each do |index|
b = link_size[index]['href'] # You are expecting to get 'href' on list item which is wrong.
return b # You shouldn't return from this block if you are expecting to get all the links. return from here will return from this method itself after first iteration.
# That's why you are getting nil error since link_size[index]['href'] doesn't have href attribute in first list item
end
end
Change your code to: (find explanations inline)
require 'nokogiri'
require 'open-uri'
def find_links(link)
page = Nokogiri::HTML(open(link))
# You want to iterate on anchor tags rather than list.
# See the use of `map`, it will return the array and since this is the last statement it will return from the method, giving all the links.
# .css('li a') will give all the anchor tags which have list item in it's parent chain.
page.css('li a').map { |x| x['href'] }
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869').each do |url|
puts url
end
require 'nokogiri'
require 'open-uri'
def find_links(link)
page = Nokogiri::HTML(open(link))
link_array = page.css('li')
(1..link_array.length).each do |f|
a=Array.new.push(page.css('li a')[f]['href'])
puts a
end
end
find_links('http://code.tutsplus.com/tutorials/you-dont-know-anything-about-regular-expressions-a-complete-guide--net-7869')

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

Creating a file for each URL of a site using Ruby

I have the following code, which creates a file with the content from a crawled site:
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_pages_like(/http:\/\/www.findbrowsenodes.com\/us\/.+\/[\d]*/) do | page |
doc = Nokogiri::HTML(open(page.url))
node_id = doc.at_css("#n_info #clipnode").text unless doc.at_css("#n_info #clipnode").nil?
node_name = doc.at_css("#n_info .node_name").text unless doc.at_css("#n_info .node_name").nil?
node_url = page.url
open("filename.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
end
Now I want to create not one but various files named node_id. I tried this:
page.each do |p|
p.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
but got this:
undefined method `value' for #<Nokogiri::XML::DTD:0x51c089a name="html"> (NoMethodError)
then tried this:
page.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
but got this:
private method `open' called for #<Anemone::Page:0x91472e8> (NoMethodError)
What's the right way of doing this?
File.open("#{node_id}.txt", "w") do |f|
f.puts "stuff"
end
How you make the assignment to node_id is up to you.

ruby method zip trying to get a string of zips I made

I have a method that zips up files I pass in.
require 'zip/zip'
def zipup(aname, aloc="/tmp/")
Zip::ZipFile.open "#{aloc}"+File.basename(aname)+".zip", Zip::ZipFile::CREATE do |zipfile|
zipfile.add File.basename(aname), aname
end
end
I need to get a string object or array object from this method that has the archive.zip name of every file that has been compressed.
rubyzip does have a to_s method all though I have failed in getting the syntax correct.
http://rubyzip.sourceforge.net/classes/Zip/ZipEntry.html#M000131
thanks from a new rubyist.
Welcome Joey, do you use the 'zip/zip' gem or just 'zip' ? If you require something, better add it to the question next time. This gem needs some extra documentation and methods it seems to me.
This works
require 'zip' #or 'zip/zip' both work
def zip_list(filename)
zipfile = Zip::ZipFile.open(filename)
list = []
zipfile.each { |entry| list << entry.name }
list
end
puts zip_list("c:/temp/zip1.zip")
another way
require 'zip/zip'
Zip::ZipFile.open("c:/temp/zip1.rb.zip") do |zipfile|
zipfile.entries.each do |entry|
puts entry.name
end
end

Save Webscraped data

I am trying to scrape a website. I am able to scrape data from that website. I am having trouble saving the data from the scrape to yaml file that I have included
My Code:
require 'rubygems'
require 'open-uri'
require 'hpricot'
article = []
doc = open("http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"{|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
puts "#{article.inner_html}"
end
File.open('test.yaml', 'w') { |f|
f <<article.to_yaml
}
First you are missing a closing parenthesis for the open call (a ) right before the block starts).
When you add that you'll notice that you'll get a NoMethodError (undefined method 'to_yaml' for []:Array). To fix that you have to require 'yaml', which pulls in the monkey-patches for the Array class. After that you'll notice that your yaml file is empty, because you never put anything into article. Here's a fixed version:
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'yaml'
articles = []
url = "http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"
doc = open(url) {|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
articles << article.inner_html
end
File.open('test.yaml', 'w') { |f| f << articles.to_yaml }

Resources