Ruby crawl site, add URL parameter - ruby

I am trying to crawl a site and append a URL parameter to each address before hitting them. Here's what I have so far:
require "spidr"
Spidr.site('http://www.example.com/') do |spider|
spider.every_url { |url| puts url }
end
But I'd like the spider to hit all pages and append a param like so:
example.com/page1?var=param1
example.com/page2?var=param1
example.com/page3?var=param1
UPDATE 1 -
Tried this, not working though, errors out ("405 method not allowed") after a few iterations:
require "spidr"
require "open-uri"
Spidr.site('http://example.com') do |spider|
spider.every_url do |url|
link= url+"?foo=bar"
response = open(link).read
end
end

Instead of relying on Spidr, I just grabbed a CSV of the URLs I needed from Google Analytics, then ran thru those. Got the job done.
require 'csv'
require 'open-uri'
CSV.foreach(File.path("the-links.csv")) do |row|
link = "http://www.example.com"+row[0]+"?foo=bar"
encoded_url = URI.encode(link)
response = open(encoded_url).read
puts encoded_url
puts
end

Related

want to get taobao's list of URL of products on search result page without taobao API

I want to get taobao's list of URL of products on search result page without taobao API.
I tried following Ruby script.
require "open-uri"
require "rubygems"
require "nokogiri"
url='https://world.taobao.com/search/search.htm?_ksTS=1517338530524_300&spm=a21bp.7806943.20151106.1&search_type=0&_input_charset=utf-8&navigator=all&json=on&q=%E6%99%BA%E8%83%BD%E6%89%8B%E8%A1%A8&cna=htqfEgp0pnwCATyQWEDB%2FRCE&callback=__jsonp_cb&abtest=_AB-LR517-LR854-LR895-PR517-PR854-PR895'
charset = nil
html = open(url) do |f|
charset = f.charset
f.read
end
doc = Nokogiri::HTML.parse(html, nil, charset)
p doc.xpath('//*[#id="list-itemList"]/div/div/ul/li[1]/div/div[1]/div/a/#href').each{|i| puts i.text}
# => 0
I want to get list of URL like https://click.simba.taobao.com/cc_im?p=%D6%C7%C4%DC%CA%D6%B1%ED&s=328917633&k=525&e=lDs3%2BStGrhmNjUyxd8vQgTvfT37ERKUkJtUYVk0Fu%2FVZc0vyfhbmm9J7EYm6FR5sh%2BLS%2FyzVVWDh7%2FfsE6tfNMMXhI%2B0UDC%2FWUl0TVvvELm1aVClOoSyIIt8ABsLj0Cfp5je%2FwbwaEz8tmCoZFXvwyPz%2F%2ByQnqo1aHsxssXTFVCsSHkx4WMF4kAJ56h9nOp2im5c3WXYS4sLWfJKNVUNrw%2BpEPOoEyjgc%2Fum8LOuDJdaryOqOtghPVQXDFcIJ70E1c5A%2F3bFCO7mlhhsIlyS%2F6JgcI%2BCdFFR%2BwwAwPq4J5149i5fG90xFC36H%2B6u9EBPvn2ws%2F3%2BHHXRqztKxB9a0FyA0nyd%2BlQX%2FeDu0eNS7syyliXsttpfoRv3qrkLwaIIuERgjVDODL9nFyPftrSrn0UKrE5HoJxUtEjsZNeQxqovgnMsw6Jeaosp7zbesM2QBfpp6NMvKM5e5s1buUV%2F1AkICwRxH7wrUN4%2BFn%2FJ0%2FIDJa4fQd4KNO7J5gQRFseQ9Z1SEPDHzgw%3D however I am getting 0
What should I do?
I don't know taobao.com but the page seems like its running lots of javascript. So perhaps the content can actually not be retrieved with a client without javascript capabilities. So instead of open-uri, you could try the gem selenium-webdriver:
https://rubygems.org/gems/selenium-webdriver/versions/2.53.4

Nokogiri Throwing Exception in Function but not outside of Function

I'm new to Ruby and am using Nokogiri to parse html webpages. An error is thrown in a function when it gets to the line:
currentPage = Nokogiri::HTML(open(url))
I have verified the inputs of the function, url is a string with a webaddress. The line I previously mention works exactly as intended when used outside of the function, but not inside. When it gets to that line inside the function the following error is thrown:
WebCrawler.rb:25:in `explore': undefined method `+#' for #<Nokogiri::HTML::Document:0x007f97ea0cdf30> (NoMethodError)
from WebCrawler.rb:43:in `<main>'
The function the problematic line is in is pasted below.
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
Here is the full program (It's not much longer):
require 'nokogiri'
require 'open-uri'
#Crawler Params
START_URL = "https://en.wikipedia.org"
CRAWLED_PAGES_COUNTER = 0
CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if CRAWLED_PAGES_COUNTER > CRAWLED_PAGES_LIMIT
return
end
CRAWLED_PAGES_COUNTER++
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore(START_URL)
require 'nokogiri'
require 'open-uri'
#Crawler Params
$START_URL = "https://en.wikipedia.org"
$CRAWLED_PAGES_COUNTER = 0
$CRAWLED_PAGES_LIMIT = 5
#Crawler Functions
def explore(url)
if $CRAWLED_PAGES_COUNTER > $CRAWLED_PAGES_LIMIT
return
end
$CRAWLED_PAGES_COUNTER+=1
currentPage = Nokogiri::HTML(open(url))
links = currentPage.xpath('//#href').map(&:value)
eval_page(currentPage)
links.each do|link|
puts link
explore(link)
end
end
def eval_page(page)
puts page.title
end
#Start Crawling
explore($START_URL)
Just to give you something to build from, this is a simple spider that only harvests and visits links. Modifying it to do other things would be easy.
require 'nokogiri'
require 'open-uri'
require 'set'
BASE_URL = 'http://example.com'
URL_FORMAT = '%s://%s:%s'
SLEEP_TIME = 30 # in seconds
urls = [BASE_URL]
last_host = BASE_URL
visited_urls = Set.new
visited_hosts = Set.new
until urls.empty?
this_uri = URI.join(last_host, urls.shift)
next if visited_urls.include?(this_uri)
puts "Scanning: #{this_uri}"
doc = Nokogiri::HTML(this_uri.open)
visited_urls << this_uri
if visited_hosts.include?(this_uri.host)
puts "Sleeping #{SLEEP_TIME} seconds to reduce server load..."
sleep SLEEP_TIME
end
visited_hosts << this_uri.host
urls += doc.search('[href]').map { |node|
node['href']
}.select { |url|
extension = File.extname(URI.parse(url).path)
extension[/\.html?$/] || extension.empty?
}
last_host = URL_FORMAT % [:scheme, :host, :port].map{ |s| this_uri.send(s) }
puts "#{urls.size} URLs remain."
end
It:
Works on http://example.com. That site is designed and designated for experimenting.
Checks to see if a page was visited previously and won't scan it again. It's a naive check and will be fooled by URLs containing queries or queries that are not in a consistent order.
Checks to see if a site was previously visited and automatically throttles the page retrieval if so. It could be fooled by aliases.
Checks to see if a page ends with ".htm", ".html" or has no extension. Anything else is ignored.
The actual code to write an industrial strength spider is much more involved. Robots.txt files need to be honored, figuring out how to deal with pages that redirect to other pages either via HTTP timeouts or JavaScript redirects is a fun task, dealing with malformed pages are a challenge....

How to use three url to make one url array. Use the same url for nokogiri

I might be crazy, but I have been trying to gather all my favorite news sites and scrap them into one ruby file. I would like to use these sites to scrape headlines and hopefully create a custom page for my site. Now so far i have been able to scrape the headlines from all three site individually. I am looking to use all three url into an array and use Nokogiri just once. Can anyone help me ?
require 'nokogiri'
require 'open-uri'
url = 'http://www.engadget.com'
data = Nokogiri::HTML(open(url))
#feeds = data.css('.post')
#feeds.each do |feed|
puts feed.css('.headline').text.strip
end
url2 = 'http://www.modmyi.com'
data2 = Nokogiri::HTML(open(url2))
#modmyi = data2.css('.title')
#modmyi.each do |mmi|
puts mmi.css('span').text
end
url3 = 'http://www.cnn.com/specials/last-50-stories'
data3 = Nokogiri::HTML(open(url3))
#cnn = data3.css('.cd__content')
#cnn.each do |cn|
puts cn.css('.cd__headline').text
end
You might want to extract the loading of the document and the extraction of the titles into its own class:
require 'nokogiri'
require 'open-uri'
class TitleLoader < Struct.new(:url, :outher_css, :inner_css)
def titles
load_posts.map { |post| extract_title(post) }
end
private
def read_document
Nokogiri::HTML(open(url))
end
def load_posts
read_document.css(outher_css)
end
def extract_title(post)
post.css(inner_css).text.strip
end
end
And than use that class like this:
urls = [
['http://www.engadget.com', '.post', '.headline'],
['http://www.modmyi.com', '.title', 'span'],
['http://www.cnn.com/specials/last-50-stories', '.cd__content', '.cd__headline']
]
urls.map { |args| TitleLoader.new(*args).titles }.flatten

Web Scraping with Nokogiri::HTML and Ruby - save images

I'm working on a script to grab data & images from webshop productpages
(with approval from the owner)
I have a working script that loops through a CSV file with 20042 product URLS to get me the data I need that is stored in a CSV file. Final thing I need is to save the product images.
I have this code (thanks to Phrogz in this thread)
URL = 'http://www.sample.com/page.html'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
def make_absolute( href, root )
URI.parse(root).merge(URI.parse(href)).to_s
end
Nokogiri::HTML(open(URL)).xpath('//*[#id="zoom"]/#href').each do |src|
uri = make_absolute(src,URL)
File.open(File.basename(uri),'wb'){ |f| f.write(open(uri).read) }
end
that runs great for a seperate URL but I'm struggling to get it working and loop through the URLS from the CSV file in my main script that starts like this:
# encoding: utf-8
require 'nokogiri'
require 'open-uri'
require 'csv'
require 'mechanize'
#prices = Array.new
#title = Array.new
#description = Array.new
#warranty = Array.new
#leadtime = Array.new
#urls = Array.new
#categories = Array.new
#subcategories = Array.new
#subsubcategories = Array.new
urls = CSV.read("lotofurls.csv")
(0..urls.length - 1).each do |index|
puts urls[index][0]
doc = Nokogiri::HTML(open(urls[index][0]))
Looks like all I need to figure out is how to feed the urls to the code saving the image but any help would be much appreciated!
You can make quick work of this with something like RMagick (or ImageMagick, MiniMagick, etc)
For RMagick, you could do something like this
require 'rmagick'
images.each do |image|
url = image.url # should be a string
Magick::Image.read(url).first.resize_to_fill(200,200).write(image.desired_filename)
end
That would write a 200x200px image for each url you provide (resize_to_fill is optional, obviously). The library is very powerful, with many, many options. If you go this route, I'd recommend the railscast on image manipulation: http://railscasts.com/episodes/374-image-manipulation
And the documentation if you want to get more advanced: http://rmagick.rubyforge.org/

Parsing multiple URLs with Nokogiri?

I am trying to get a list of hrefs from a list of ten URLs and running into trouble.
Each of these blocks work separately from each other, but, when I try to combine them, I get a list of pages 1-10 and an error. What is the proper way to go about this?
#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
#/ this prints all 10 of the URLs to pull page hrefs from.
1.upto(10) do |pagenum|
url = "http://www.mywebsite.com/page/#{pagenum}"
puts url
end
#/ Prints out all of the hrefs.
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
Here's your code, annotated:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
end
# Open...uhm...which URL?
doc = Nokogiri::HTML(open(url))
The problem is that the url variable is "scoped" locally to the upto block. It no longer exists once you exist that block. Perhaps you wanted this:
1.upto(10) do |pagenum|
# Create a local variable named `url`
url = "http://www.mywebsite.com/page/#{pagenum}"
# Print it
puts url
# Print this URL
doc = Nokogiri::HTML(open(url))
doc.xpath('//h2/a/#href').each do |node|
puts node.text
end
end

Resources