Is there a module like Perl's LWP for Ruby? - ruby

In Perl there is an LWP module:
The libwww-perl collection is a set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web. The main focus of the library is to provide classes and functions that allow you to write WWW clients. The library also contain modules that are of more general use and even classes that help you implement simple HTTP servers.
Is there a similar module (gem) for Ruby?
Update
Here is an example of a function I have made that extracts URL's from a specific website.
use LWP::UserAgent;
use HTML::TreeBuilder 3;
use HTML::TokeParser;
sub get_gallery_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
# send request
$response_u = $ua->request($req);
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $root = HTML::TreeBuilder->new;
$root->parse($response_u->content);
my #gu = $root->find_by_attribute("id", "thumbnails");
my %urls = ();
foreach my $g (#gu) {
my #as = $g->find_by_tag_name('a');
foreach $a (#as) {
my $u = $a->attr("href");
if ($u =~ /^\//) {
$urls{"http://example.com"."$u"} = 1;
}
}
}
return %urls;
}

The closest match is probably httpclient, which aims to be the equivalent of LWP. However, depending on what you plan to do, there may be better options. If you plan to follow links, fill out forms, etc. in order to scrape web content, you can use Mechanize which is similar to the perl module by the same name. There are also more Ruby-specific gems, such as the excellent Rest-client and HTTParty (my personal favorite). See the HTTP Clients category of Ruby Toolbox for a larger list.
Update: Here's an example of how to find all links on a page in Mechanize (Ruby, but it would be similar in Perl):
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://example.com/')
page.links.each do |link|
puts link.text
end
P.S. As an ex-Perler myself, I used to worry about abandoning the excellent CPAN--would I paint myself into a corner with Ruby? Would I not be able to find an equivalent to a module I rely on? This has turned out not to be a problem at all, and in fact lately has been quite the opposite: Ruby (along with Python) tends to be the first to get client support for new platforms/web services, etc.

Here's what your function might look like in ruby.
require 'rubygems'
require "mechanize"
def get_gallery_urls url
ua = Mechanize.new
ua.user_agent = "Mozilla/8.0"
urls = {}
doc = ua.get url
doc.search("#thumbnails a").each do |a|
u = a["href"]
urls["http://example.com#{u}"] = 1 if u =~ /^\//
end
urls
end
Much nicer :)

I used Perl for years and years, and liked LWP. It was a great tool. However, here's how I'd go about extracting URLs from a page. This isn't spidering a site, but that'd be an easy thing:
require 'open-uri'
require 'uri'
urls = URI.extract(open('http://example.com').read)
puts urls
With the resulting output looking like:
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
http://www.w3.org/1999/xhtml
http://www.icann.org/
mailto:iana#iana.org?subject=General%20website%20feedback
Writing that as a method:
require 'open-uri'
require 'uri'
def get_gallery_urls(url)
URI.extract(open(url).read)
end
or, closer to the original function while doing it the Ruby-way:
def get_gallery_urls(url)
URI.extract(open(url).read).map{ |u|
URI.parse(u).host ? u : URI.join(url, u).to_s
}
end
or, following closer to the original code:
require 'nokogiri'
require 'open-uri'
require 'uri'
def get_gallery_urls(url)
Nokogiri::HTML(
open(url)
)
.at('#thumbnails')
.search('a')
.map{ |link|
href = link['href']
URI.parse(link[href]).host \
? href \
: URI.join(url, href).to_s
}
end
One of the things that attracted me to Ruby is its ability to be readable, while still being concise.
If you want to roll your own TCP/IP-based functions, Ruby's standard Net library is the starting point. By default you get:
net/ftp
net/http
net/imap
net/pop
net/smtp
net/telnet
with the SSL-based ssh, scp, sftp and others available as gems. Use gem search net -r | grep ^net- to see a short list.

This is more of an answer for anyone looking at this question and needing to know what are easier/better/different alternatives to general web scraping with Perl compared to using LWP (and even WWW::Mechanize).
Here is a quick selection of web scraping modules on CPAN:
Mojo::UserAgent
pQuery
Scrappy
Web::Magic
Web::Scraper
Web::Query
NB. Above is just in alphabetical order so please choose your favourite poison :)
For most of my recent web scraping I've been using pQuery. You can see there are quite a few examples of usage on SO.
Below is your get_gallery_urls example using pQuery:
use strict;
use warnings;
use pQuery;
sub get_gallery_urls {
my $url = shift;
my %urls;
pQuery($url)
->find("#thumbnails a")
->each( sub {
my $u = $_->getAttribute('href');
$urls{'http://example.com' . $u} = 1 if $u =~ /^\//;
});
return %urls;
}
PS. As Daxim has said in the comments there are plenty of excellent Perl tools for web scraping. The hardest part is just making a choice of which one to use!

Related

Google Maps API accessible with Java, Python or Ruby?

Is there a way anyone knows to call the Google Maps APIs from Ruby for example?
With a key, you can access the APIs through simple HTTPS requests, which you can send using open-uri and parse using json.
require 'open-uri'
require 'ostruct'
require 'json'
def journey_between start, destination
key = "[Visit https://developers.google.com/maps/web/ to get a free key]"
url = "https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins=#{start}&destinations=#{destination}&key=#{key}"
json_response = open(url).read
journey_data = JSON.parse(json_response, object_class: OpenStruct).rows[0].elements[0]
return journey_data
end
journey = journey_between "London", "Glasgow"
puts journey.distance.text
#=> "412 mi"
puts journey.duration.text
#=> "6 hours 46 mins"
Unfortunately, you can't try this example without an API key. You can get one at https://developers.google.com/maps/web/ for free by registering a project under your Google account.

Web Scraping with Nokogiri and Mechanize

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[#class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end
I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent.
Then creating a page variable and assigning it the url provided.
Then creating a variable that is a nokogiri object with the mechanize url passed in
Then searching the url for all class references that are titled nextItem
Then printing all the data contained there
Can someone show me where I might have went wrong?
Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.
Generally speaking, with Mechanize, after you get a page:
page = agent.get(page_url)
you can easily search items with CSS selectors and scrape for data:
next_items = page.search(".fooClass")
next_items.each do |item|
price = item.search(".fooPrice").text
end
Then simply handle the strings or generate hashes as you desire.
Here are the wrong parts:
Check again the block syntax - use {} or do/end but not both in the same time.
Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
Finally check maybe more about Ruby's basics before continuing with web scraping.
Here is the code with fixes:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[#class='nextItem']").each do |i|
fp.write(i.text + "\n")
end
fp.close

ruby and curl: skipping invalid pages

I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian
but if you try the same where a page has no title for example
curl = %x(curl http://zales.1.ai)
it dies with undefined method for nill class as it has no title ....
I can't check if curl is nil as it is not in this case (it contains another line)
Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.
UPDATE
I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work
1) open-uri simply fails in a simple domain with _
like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri
2)The second case if a page has no title like
http://zales.1.ai
3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/
A fourth case would be a page that has nothing but an internal server error or passenger/rack error.
The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)
curl = %x(curl http://voldemortas.1.ai/)
begin
simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
simian = "" # curl was nil?
rescue ArguementError
simian = "" # not html?
end
puts simian
Now I am aware that this is not elegant nor optimal.
REPHRASED QUESTION
Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).
You're making it much to hard on yourself.
Nokogiri doesn't care where you get the HTML, it just wants the body of the document. You can use Curb, Open-URI, a raw Net::HTTP connection, and it'll parse the content returned.
Try Curb:
require 'curb'
require 'nokogiri'
doc = Nokogiri::HTML(Curl.get('http://http://odin.1.ai').body_str)
doc.at('title').text
=> "Welcome to Dotgeek.org * 1.ai"
If you don't know whether you'll have a <title> tag, then don't try to do it all at once:
title = doc.at('title')
next if (!title)
puts title.text
Take a look at "equivalent of curl for Ruby?" for more ideas.
You just need to check for the match before accessing it. If curl.match is nil, the you can't access the grouping:
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)
simian &&= simian[1] # only access the matched group if available
puts simian
Do heed the Tin Man's advice and use Nokogiri. Your regexp is really only suitable for a brittle solution -- it fails when the title element is spread over multiple lines.
Update
If you really don't want to use an HTML parser and if you promise this is for a quick script, you can use OpenURI (wrapper around net/http) in the standard library. It's at least a little cleaner than parsing curl output.
require 'open-uri'
def extract_title_content(line)
title = line.match(%r{<title>(.*)</title>})
title &&= title[1]
end
def extract_title_from(uri)
title = nil
open(uri) do |page|
page.lines.each do |line|
return title if title = extract_title_content(line)
end
end
rescue OpenURI::HTTPError => e
STDERR.puts "ERROR: Could not download #{uri} (#{e})"
end
puts extract_title_from 'http://odin.1.ai'
What you're really looking for, it seems, is a way to skip non-HTML responses. That's much easier with a curl wrapper like curb, like the Tin Man suggested, than dropping to the shell and using curl there:
1.9.3p125 :001 > require 'curb'
=> true
1.9.3p125 :002 > response = Curl.get('http://odin.1.ai')
=> #<Curl::Easy http://odin.1.ai?>
1.9.3p125 :003 > response.content_type
=> "text/html"
1.9.3p125 :004 > response = Curl.get('http://voldemortas.1.ai')
=> #<Curl::Easy http://voldemortas.1.ai?>
1.9.3p125 :005 > response.content_type
=> "image/png"
1.9.3p125 :006 >
So your code could look something like this:
response = Curl.get(url)
if response.content_type == "text/html" # or more fuzzy: =~ /text/
match = response.body_str.match(/<title>(.*)<\/title>/)
title = match && match[1]
# or use Nokogiri for heavier lifting
end
No more exceptions
puts simian

How to visit a URL with Ruby via http and read the output?

So far I have been able to stitch this together :)
begin
open("http://www.somemain.com/" + path + "/" + blah)
rescue OpenURI::HTTPError
#failure += painting.permalink
else
#success += painting.permalink
end
But how do I read the output of the service that I would be calling?
Open-URI extends open, so you'll get a type of IO stream returned:
open('http://www.example.com') #=> #<StringIO:0x00000100977420>
You have to read that to get content:
open('http://www.example.com').read[0 .. 10] #=> "<!DOCTYPE h"
A lot of times a method will let you pass different types as a parameter. They check to see what it is and either use the contents directly, in the case of a string, or read the handle if it's a stream.
For HTML and XML, such as RSS feeds, we'll typically pass the handle to a parser and let it grab the content, parse it, and return an object suitable for searching further:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
doc.class #=> Nokogiri::HTML::Document
doc.to_html[0 .. 10] #=> "<!DOCTYPE h"
doc.at('h1').text #=> "Example Domains"
doc = open("http://etc..")
content = doc.read
More often people want to be able to parse the returned document, for this use something like hpricot or nokogiri
I'm not sure if you want to do this yourself for the hell of it or not but if you don't.. Mecanize is a really nice gem for doing this.
It will visit the page you want and automatically wrap the page with nokogiri so that you can access it's elements with css selectors such as "div#header h1". Ryan Bates has a video tutorial on it which will teach you everything you need to know to use it.
Basically you can just
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
agent.get("http://www.google.com")
agent.page.at("some css selector").text
It's that simple.

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources