anemone print links on first page - ruby

wanted to see what i was doing wrong. here.
I need to print the links on the parent page, even they are for another domain. And get out.
require 'anemone'
url = ARGV[0]
Anemone.crawl(url, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.links.each do |link|
puts link
end
end
end
what am i not doing right?
Edit: Outputs nothing.

This worked for me
require 'anemone'
require 'optparse'
file = ARGV[0]
File.open(file).each do |url|
url = URI.parse(URI.encode(url.strip))
Anemone.crawl(url, :discard_page_bodies => true) do |anemone|
anemone.on_every_page do |page|
links = page.doc.xpath("//a/#href")
if (links != nil)
links.each do |link|
puts link.to_s
end
end
end
end
end

Related

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

Creating a file for each URL of a site using Ruby

I have the following code, which creates a file with the content from a crawled site:
require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'
Anemone.crawl("http://www.findbrowsenodes.com/", :delay => 3) do |anemone|
anemone.on_pages_like(/http:\/\/www.findbrowsenodes.com\/us\/.+\/[\d]*/) do | page |
doc = Nokogiri::HTML(open(page.url))
node_id = doc.at_css("#n_info #clipnode").text unless doc.at_css("#n_info #clipnode").nil?
node_name = doc.at_css("#n_info .node_name").text unless doc.at_css("#n_info .node_name").nil?
node_url = page.url
open("filename.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
end
Now I want to create not one but various files named node_id. I tried this:
page.each do |p|
p.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
end
but got this:
undefined method `value' for #<Nokogiri::XML::DTD:0x51c089a name="html"> (NoMethodError)
then tried this:
page.open("#{node_id}.txt", "a") do |f|
f.puts "#{node_id}\t#{node_name}\t#{node_url}"
end
but got this:
private method `open' called for #<Anemone::Page:0x91472e8> (NoMethodError)
What's the right way of doing this?
File.open("#{node_id}.txt", "w") do |f|
f.puts "stuff"
end
How you make the assignment to node_id is up to you.

download a pdf file with webcrawler

I'm beginning to use to the ruby programming language. I have a ruby script to crawl pdf files on page with anemone:
Anemone.crawl("http://example.com") do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
puts page.url
end
end
I want download page.url using gem ruby. What gem can I use to download page.url?
No need for an extra gem, try this
require 'anemone'
Anemone.crawl("http://www.rubyinside.com/media/",:depth_limit => 1, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
begin
filename = File.basename(page.url.request_uri.to_s)
File.open(filename,"wb") {|f| f.write(page.body)}
puts "downloaded #{page.url}"
rescue
puts "error while downloading #{page.url}"
end
end
end
gives
downloaded http://www.rubyinside.com/media/poignant-guide.pdf
and the pdf is fine.
If you're on a UNIX system, maybe UnixUtils:
Anemone.crawl("http://example.com") do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
puts page.url # => http://example.com/foo.bar
puts UnixUtils.curl(url) # => /tmp/foo.bar.1239u98sd
end
end

Watir open multiple browser's or tab's

How can I open more than one browser using my code-watir, for example via a while loop from 0 to 10?
Here is my code:
require 'watir-webdriver'
require 'headless'
class Page
#headless = Headless.new
#headless.start
#browser = Watir::Browser.start 'bit.ly/***'
def self.get_connection
puts "Browser started"
puts #browser.title
#browser.driver.manage.timeouts.implicit_wait = 3 #3 seconds
#browser.select_list(:name => 'ctl00$tresc$111').select_value('6')
puts "Selected country"
#browser.select_list(:name => 'ctl00$tresc$222').wait_until_present
#browser.select_list(:name => 'ctl00$tresc$333').select_value('95')
puts "Selected city"
end
def self.close_connection
#browser.close
#headless.destroy
end
end
Page.get_connection
Page.close_connection
But how to do something like this?
while i < 10
Page.get_connection
end
This should open ten browsers:
10.times {Watir::Browser.new}
If you want to use the browsers later, you can put them in a hash:
browsers = {}
(0..9).each {|i| browsers[i] = Watir::Browser.new}
browsers[0].goto "google.com"
browsers[1].goto "yahoo.com"

Resources