I am trying to get a list of hrefs from a list of ten URLs and running into trouble.
Each of these blocks work separately from each other, but, when I try to combine them, I get a list of pages 1-10 and an error. What is the proper way to go about this?
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
#/ this prints all 10 of the URLs to pull page hrefs from.
1.upto(10) do |pagenum|
url = "http://www.mywebsite.com/page/#{pagenum}"
puts url
end
#/ Prints out all of the hrefs.
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
Here's your code, annotated:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
end
# Open...uhm...which URL?
doc = Nokogiri::HTML(open(url))
The problem is that the url variable is "scoped" locally to the upto block. It no longer exists once you exist that block. Perhaps you wanted this:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
# Print this URL
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
end
Related
I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end
Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end
I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....
I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.
I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten
I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/