Screen Scraping in Rails 3 - ruby

What are the screen scraping options in Rails 3 - gem/library? I have used Nokogiri in the past but just wanted to know if there are better options in Rails 3.

If this is a one-off task or if your target data set is relatively small (under a hundred of pages), use Mechanize (browse & scrape) or Anemone (does whatever Mechanize does + some additional crawling-specific options).
If you need to automate this collection or if you are dealing with large data sets, consider using a web service. Bobik is a good choice in this bucket.

Rails doesn't do screen scraping. You are free to use Ruby code that would add that functionality, but by itself it does the generation of the pages.
Mechanize, which uses Nokogiri internally, is a good choice, otherwise I always roll my own using Nokogiri and OpenURI.

In the fantastic RubyTools website you can find several Ruby libraries to parsing HTML. Still Nokogiri is the most popular.

You can also use the Scrapifier gem to get metadata from URIs found in a string. It's very simple to use:
'Wow! What an awesome site: http://adtangerine.com!'.scrapify
#=> {
# title: "AdTangerine | Advertising Platform for Social Media",
# description: "AdTangerine is an advertising platform that uses the tangerine as a virtual currency for advertisers and publishers in order to share content on social networks.",
# images: ["http://adtangerine.com/assets/logo_adt_og.png", "http://adtangerine.com/assets/logo_adt_og.png", "http://s3-us-west-2.amazonaws.com/adtangerine-prod/users/avatars/000/000/834/thumb/275747_1118382211_1929809351_n.jpg", "http://adtangerine.com/assets/foobar.gif"],
# uri: "http://adtangerine.com"
# }

Related

Fastest Net::HTTP/Net::HTTPS wrapper for Ruby

What is the fastest way to download a webpage for parsing in Ruby? I've tried using open-uri and HTTParty both seem to take roughly about 25 seconds to download simple webpages (I've tried multiple sites).
I'm passing the sites to Nokogiri but the latency takes place prior to passing any parameters to Nokogiri.
i prefer to use gem 'http' (https://github.com/httprb/http). It is fast, clean api.
Also you can take a look on the http clients comparison table:
https://github.com/httprb/http#another-ruby-http-library-why-should-i-care

How do I view my gem's (or any other gem's) gem.description and gem.summary texts (.gemspect items) from the command line?

The .gemspec file I so carefully documented when I created my gem, how do I access its contents? Specifically I'd like to access the gem.description and gem.summary entries because I put some very useful info in there.
I hope there is a better answer than this, reading YAML can be annoying, but you can use gem specification GEMNAME. This will spit out a lot of information, you might want to pipe that to grep.
You can provide a slightly more readable output by piping the output of gem specification to something to pull out what you want.
This can be much more legible, especially when the description is a multiline string:
% gem specification rack description | ruby -ryaml -e 'puts YAML.load(STDIN.read)'
Rack provides a minimal, modular and adaptable interface for developing
web applications in Ruby. By wrapping HTTP requests and responses in
the simplest way possible, it unifies and distills the API for web
servers, web frameworks, and software in between (the so-called
middleware) into a single method call.
Also see http://rack.github.com/.
% gem specification hoe description | ruby -ryaml -e 'puts YAML.load(STDIN.read)'
Hoe is a rake/rubygems helper for project Rakefiles. It helps you
manage, maintain, and release your project and includes a dynamic
plug-in system allowing for easy extensibility. Hoe ships with
plug-ins for all your usual project tasks including rdoc generation,
testing, packaging, deployment, and announcement..
See class rdoc for help. Hint: `ri Hoe` or any of the plugins listed
below.
For extra goodness, see: http://seattlerb.rubyforge.org/hoe/Hoe.pdf

Ruby on Rails with chargify

I want to integrate chargify to my rails app. I have user object and I want the user to be able to subscribe for one month and update the boolean column on user object. I prefer to use the API not hosted pages. How can I do that?
Is there any example for chargify on ruby on rails for handling subscriptions but with details about mvc for newbies?
Based on this thread and the Googles it looks like there is not a whole lot out there.
You could try looking at the Rails 2 example here and converting it or use the gem here (gem "chargify", "~> 0.3.0").
I know none of this is aimed at newbies but the info seems to sparse.
This might get you going. It seems that Chartify itself is written in Rails, and therfore their API is ruby code, which you can use...

Ruby: How to screen-scrape the result of an Ajax request

I have written a ruby script to screen scrape something using the 'open-uri' and 'hpricot' gems - everything works great so far.
But now I have to screen scrape something which is returned after a form is submitted via a javascript function (called by an 'onchange' event handler from a drop-down menu):
function submit_form() {
document.list.action="/some/sort/of/path";
document.list.submit();
}
AFAIK, open-uri lets you submit only GET requests. And if I'm not mistaken, a POST request would be needed here.
So my question is: what do I need to install and to 'require' and how would the ruby code then look like (to make that POST request) - sorry, I'm still pretty much of a n00b...
Thank you very much for your help!
Tom
I think you definitely should use Mechanize. It provides a nifty interface to interact with remote pages, forms on them, and so forth (see this example).
The Ruby standard library has the http class, which naturally supports the POST operation.
Net::HTTP.post_form(URI.parse('http://www.example.com/some/sort/of/path')
If you find the API there less than optimal, then take a look at the httparty gem
Finally, while hpricot is a great gem, it isn't actively developed any longer. You should consider moving to nokogiri which practically replaces hpricot and improves upon it.

What is rack? can I use it build web apps with Ruby?

ruby newbie alert! (hey that rhymes :))
I have read the official definition but still come up empty handed. What exactly is it when they say middleware? Is the purpose using ruby with https?
the smallish tutorial at patnaik's blog makes things clearer but how do I do something with it on localhost? I have ruby 1.9.2 installed along with rack gem and mongrel server.
Do I start mongrel first? How?
Just to add a simplistic explanation of Rack (as I feel that is missing):
Rack is basically a way in which a web app can communicate with a web server. The communication goes like this:
The web server tells the app about the environment - this contains mainly what the user sent in as his request - the url, the headers, whether it's a GET or a POST, etc.
The web app responds with three things:
the status code which will be something like 200 when everything went OK and above 400 when something went wrong.
the headers which is information web browsers can use like information on how long to hold on to the webpage in their cache and other stuff.
the body which is the actual webpage you see in the browser.
These two steps more or less can define the whole process by which web apps work.
So a very simple Rack app could look like this:
class MyApp
def call(environment) # this method has to be named call
[200, # the status code
{"Content-Type" => "text/plain", "Content-length" => "11" }, # headers
["Hello world"]] # the body
end
end
# presuming you have rack & webrick
if $0 == __FILE__
require 'rack'
Rack::Handler::WEBrick.run MyApp.new
end
You would do well to search for other questions & answers that make sense to you. Try "Getting Started with Rails" or "Ruby Web Development". A lot of different topics on this site have been devoted to this exact subject, so you might save yourself some trouble there...
Ignoring the specifics of your question for a minute, it seems like you want to learn Ruby and build web apps. Before you start delving into Rack or Mongrel or anything else, you should know that there are 2 well established frameworks that help build Ruby web applications. The first is Ruby on Rails, and the other is Sinatra. There are many others, but these are the most well documented on Stack Overflow and the internet in general.
Check out the following links for some background...
www.rubyonrails.org
SO: building-a-website-best-practice-and-architecture-with-ruby
www.railstutorial.org
SO: learning-ruby-on-rails
If you still have a burning desire to answer your question - "what is rack?", you should follow the same process, and end up at this Stack Overflow Answer:
What is Rack middleware?
Good luck!
Very nice answers yes indeed. For my two cents I'll add this because if you know how to get to the documentation behind the scenes here you will find lots of information as I have it stashed here and by no means is all that I have.
http://myrackapps.herokuapp.com/

Resources