download a pdf file with webcrawler - ruby

I'm beginning to use to the ruby programming language. I have a ruby script to crawl pdf files on page with anemone:
Anemone.crawl("http://example.com") do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
puts page.url
end
end
I want download page.url using gem ruby. What gem can I use to download page.url?

No need for an extra gem, try this
require 'anemone'
Anemone.crawl("http://www.rubyinside.com/media/",:depth_limit => 1, :obey_robots_txt => true, :skip_query_strings => true) do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
begin
filename = File.basename(page.url.request_uri.to_s)
File.open(filename,"wb") {|f| f.write(page.body)}
puts "downloaded #{page.url}"
rescue
puts "error while downloading #{page.url}"
end
end
end
gives
downloaded http://www.rubyinside.com/media/poignant-guide.pdf
and the pdf is fine.

If you're on a UNIX system, maybe UnixUtils:
Anemone.crawl("http://example.com") do |anemone|
anemone.on_pages_like(/\b.+.pdf/) do |page|
puts page.url # => http://example.com/foo.bar
puts UnixUtils.curl(url) # => /tmp/foo.bar.1239u98sd
end
end

Related

Ruby - Getting page content even if it doesn't exist

I am trying to put together a series of custom 404 pages.
require 'uri'
def open(url)
page_content = Net::HTTP.get(URI.parse(url))
puts page_content.content
end
open('http://somesite.com/1ygjah1761')
the following code exits the program with an error. How can I get the page content from a website, regardless of it being 404 or not.
You need to rescue from the error
def open(url)
require 'net/http'
page_content = ""
begin
page_content = Net::HTTP.get(URI.parse(url))
puts page_content
rescue Net::HTTPNotFound
puts "THIS IS 404" + page_content
end
end
You can find more information on something like this here: http://tammersaleh.com/posts/rescuing-net-http-exceptions/
Net::HTTP.get returns the page content directly as a string, so there is no need to call .content on the results:
page_content = Net::HTTP.get(URI.parse(url))
puts page_content

Where can I use HAML in Octopress?

So, according to the Octopress official page, it has HAML integration plugin. Naturally, I gave it a try. I backed up my source/_includes/custom/head.html file, converted it to haml and saved it as source/_includes/custom/head.haml. It gave me an error.
I tried doing the same with source/_layouts/page.html file, and it worked like a charm.
My question is, where can I and where can I not use HAML in an Octopress blog?
AS you can see from the source code, the HAML is only processing pages content.
See the convert && output_ext methods.
https://github.com/imathis/octopress/blob/master/plugins/haml.rb
module Jekyll
require 'haml'
class HamlConverter < Converter
safe true
priority :low
def matches(ext)
ext =~ /haml/i
end
def output_ext(ext)
".html"
end
def convert(content)
begin
engine = Haml::Engine.new(content)
engine.render
rescue StandardError => e
puts "!!! HAML Error: " + e.message
end
end
end
end

anemone print links on first page

wanted to see what i was doing wrong. here.
I need to print the links on the parent page, even they are for another domain. And get out.
require 'anemone'
url = ARGV[0]
Anemone.crawl(url, :depth_limit => 1) do |anemone|
anemone.on_every_page do |page|
page.links.each do |link|
puts link
end
end
end
what am i not doing right?
Edit: Outputs nothing.
This worked for me
require 'anemone'
require 'optparse'
file = ARGV[0]
File.open(file).each do |url|
url = URI.parse(URI.encode(url.strip))
Anemone.crawl(url, :discard_page_bodies => true) do |anemone|
anemone.on_every_page do |page|
links = page.doc.xpath("//a/#href")
if (links != nil)
links.each do |link|
puts link.to_s
end
end
end
end
end

Ruby scraper. How to export to CSV?

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being thrown:
scraper.rb:45: undefined method `send_data' for main:Object (NoMethodError)
I do not understand this piece of code. What's this doing and why isn't it working right?
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
Full code:
#!/usr/bin/ruby
require 'rubygems'
require 'anemone'
require 'fastercsv'
productsArray = Array.new
class Product
attr_accessor :name, :sku, :desc
end
# Scraper Code
Anemone.crawl("http://retail.pelicanbayltd.com/") do |anemone|
anemone.on_every_page do |page|
currentPage = Product.new
#Product info parsing
currentPage.name = page.doc.css(".page_headers").text
currentPage.sku = page.doc.css("tr:nth-child(2) strong").text
currentPage.desc = page.doc.css("tr:nth-child(4) .item").text
if currentPage.sku =~ /#\d\d\d\d/
currentPage.sku = currentPage.sku[1..-1]
productsArray.push(currentPage)
end
end
end
# CSV Export Code
products = productsArray.find(:all)
csv_data = FasterCSV.generate do |csv|
# header row
csv << ["sku", "name", "desc"]
# data rows
productsArray.each do |product|
csv << [product.sku, product.name, product.desc]
end
end
send_data csv_data,
:type => 'text/csv; charset=iso-8859-1; header=present',
:disposition => "attachment; filename=products.csv"
If you are new to Ruby, you should be using Ruby 1.9 or later, in which case you can use the built-in CSV output which builds in fast csv plus l18n support:
require 'csv'
CSV.open('filename.csv', 'w') do |csv|
csv << [sku, name, desc]
end
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html
File.open('filename.csv', 'w') do |f|
f.write(csv_data)
end
It probably makes more sense to do:
#csv = FasterCSV.open('filename.csv', 'w')
and then write to it as you go along:
#csv << [sku, name, desc]
that way if your script crashes halfway through you've at least got half of the data.

Ruby ODT file open Zip/Zip

I am trying to access the insides of an ODT file. I'll run it through IRB and it will work perfectly fine but when I try and write a script to do it, it fails with this error:
./replace_odf.rb:3:in `require': no such file to load -- rubygems (LoadError)
from ./replace_odf.rb:3
Here is my code when ran through IRB. As you can see towards the end, it can access the file.
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'zip/zip'
=> true
irb(main):003:0> odt = Zip::ZipFile.open('java.odt')
=> java.odt
irb(main):004:0> odt.entries.each do |entry|
irb(main):005:1* puts entry.name
irb(main):006:1> end
mimetype
Configurations2/statusbar/
Configurations2/accelerator/current.xml
Configurations2/floater/
... etc
Here is my script code. When ran, it gives the error posted above.
require 'rubygems'
require 'zip/zip'
require 'rexml/document'
odt = Zip::ZipFile.open('java.odt')
file1 = odt.entries[0]
odt.entries.each do |entry|
puts entry.name if entry.name =~ /\.xml$/
end
puts odt.read("mimetype")
xml = odt.read("content.xml")
doc = REXML::Document.new(xml)
doc.root.each_element do |o|
o.each_element do |i|
puts i
end
end

Resources