Google Translate API call in Ruby generates "URL too Large" error - ruby

The following Ruby code accesses Google Translate's API:
require 'google/api_client'
client = Google::APIClient.new
translate = client.discovered_api('translate','v2')
client.authorization.access_token = '123' # dummy
client.key = "<My Google API Key>"
response = client.execute(
:api_method => translate.translations.list,
:parameters => {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
)
The response comes back in JSON format.
This code works for text-to-be-translated (the q argument) of less than approximately 750 characters, but generates the following error for longer text:
Error 414 (Request-URI Too Large)!!1Error 414 (Request-URI Too Large)!!1
I have Googled this problem and found the following pages:
https://developers.google.com/translate/
https://github.com/google/google-api-ruby-client
It seems that the Google API Client code is placing the call to the Google API
using a GET instead of a POST. As various systems place limits on the length of URLs,
this restricts the length of text that can be translated.
The form of the solution to this problem seems to be that I should instruct
the Google API interface to place the call to the Google API using a POST
instead of a GET. However, I so far haven't been able to figure out how to do this.
I see that there are a few other StackOverflow questions relating to this
issue in other languages. However, I so far haven't been able to figure out how
to apply it to my Ruby code.
Thanks very much Dave Sag. Your approach worked, and I have developed it
into a working example that other Stack Overflow users might find useful.
(Note: Stack Overflow won't let me post this as an answer for eight hours, so I'm
editing my question instead.) Here is the example:
#!/usr/local/rvm/rubies/ruby-1.9.3-p194/bin/ruby
#
# This is a self-contained Unix/Linux Ruby script.
#
# Replace the first line above with the location of your ruby executable. e.g.
#!/usr/bin/ruby
#
################################################################################
#
# Author : Ross Williams.
# Date : 1 January 2014.
# Version : 1.
# Licence : Public Domain. No warranty or liability.
#
# WARNING: This code is example code designed only to help Stack Overflow
# users solve a specific problem. It works, but it does not contain
# comprehensive error checking which would probably double the length
# of the code.
require 'uri'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'json'
################################################################################
#
# This method translates a string of up to 5K from one language to another using
# the Google Translate API.
#
# google_api_key
# This is a string of about 39 characters which identifies you
# as a user of the Google Translate API. At the time of writing,
# this API costs US$20/MB of source text. You can obtain a key
# by registering at:
# https://developers.google.com/translate/
# https://developers.google.com/translate/v2/getting_started
#
# source_language_code
# target_language_code
# These arguments identify the source and target language.
# Each of these arguments should be a string containing one
# of Google's two-letter language identification codes.
# A list of these codes can be found at:
# https://sites.google.com/site/opti365/translate_codes
#
# text_to_be_translated
# This is a string to be translated. Ruby provides excellent
# Unicode support and this string can contain non-ASCII characters.
# This string must not be longer than 5K characters long.
#
def google_translate(google_api_key,source_language_code,target_language_code,text_to_be_translated)
# Note: The 5K limit on text_to_be_translated is stated at:
# https://developers.google.com/translate/v2/using_rest?hl=ja
# It is not clear to me whether this is 5*1024 or 5*1000.
# The Google Translate API is served at this HTTPS address.
# See http://stackoverflow.com/questions/13152264/sending-http-post-request-in-ruby-by-nethttp
uri = URI.parse("https://www.googleapis.com/language/translate/v2")
request = Net::HTTP::Post.new(uri.path)
# The API accepts only the GET method. However, the GET method has URL length
# limitations so we have to use POST method instead. We need to use a trick for
# deliverying the form fields using POST but having the API perceive it as a GET.
# Setting a key/value pair in request adds a field to the HTTP header to do this.
request['X-HTTP-Method-Override'] = 'GET'
# Load the arguments to the API into the POST form fields.
params = {
'access_token' => '123', # Dummy parameter seems to be required.
'key' => google_api_key,
'format' => 'text',
'source' => source_language_code,
'target' => target_language_code,
'q' => text_to_be_translated
}
request.set_form_data(params)
# Execute the request.
# It's important to use HTTPS, as the API key is being transmitted.
https = Net::HTTP.new(uri.host,uri.port)
https.use_ssl = true
response = https.request(request)
# The API returns a record in JSON format.
# See http://en.wikipedia.org/wiki/JSON
# See http://www.json.org/
json_text = response.body
# Parse the JSON record, yielding a nested hash/list data structure.
json_structure = JSON.parse(json_text)
# Navigate down into the data structure to get the result.
# This navigation was coded by examining the JSON text from an actual run.
data_hash = json_structure['data']
translations_list = data_hash['translations']
translation_hash = translations_list[0]
translated_text = translation_hash['translatedText']
return translated_text
end # def google_translate
################################################################################
google_api_key = '<INSERT YOUR GOOGLE TRANSLATE API KEY HERE>'
source_language_code = 'en' # English
target_language_code = 'fr' # French
# To test the code, I have chosen a sample text of about 3393 characters.
# This is large enough to exceed the GET URL length limit, but small enough
# not to exceed the Google Translate API length limit of 5K.
# Sample text is from http://www.gutenberg.org/cache/epub/2701/pg2701.txt
text_to_be_translated = <<END
Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
There now is your insular city of the Manhattoes, belted round by
wharves as Indian isles by coral reefs--commerce surrounds it with
her surf. Right and left, the streets take you waterward. Its extreme
downtown is the battery, where that noble mole is washed by waves, and
cooled by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.
Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears
Hook to Coenties Slip, and from thence, by Whitehall, northward. What
do you see?--Posted like silent sentinels all around the town, stand
thousands upon thousands of mortal men fixed in ocean reveries. Some
leaning against the spiles; some seated upon the pier-heads; some
looking over the bulwarks of ships from China; some high aloft in the
rigging, as if striving to get a still better seaward peep. But these
are all landsmen; of week days pent up in lath and plaster--tied to
counters, nailed to benches, clinched to desks. How then is this? Are
the green fields gone? What do they here?
But look! here come more crowds, pacing straight for the water, and
seemingly bound for a dive. Strange! Nothing will content them but the
extremest limit of the land; loitering under the shady lee of yonder
warehouses will not suffice. No. They must get just as nigh the water
as they possibly can without falling in. And there they stand--miles of
them--leagues. Inlanders all, they come from lanes and alleys, streets
and avenues--north, east, south, and west. Yet here they all unite.
Tell me, does the magnetic virtue of the needles of the compasses of all
those ships attract them thither?
Once more. Say you are in the country; in some high land of lakes. Take
almost any path you please, and ten to one it carries you down in a
dale, and leaves you there by a pool in the stream. There is magic
in it. Let the most absent-minded of men be plunged in his deepest
reveries--stand that man on his legs, set his feet a-going, and he will
infallibly lead you to water, if water there be in all that region.
Should you ever be athirst in the great American desert, try this
experiment, if your caravan happen to be supplied with a metaphysical
professor. Yes, as every one knows, meditation and water are wedded for
ever.
END
translated_text =
google_translate(google_api_key,
source_language_code,
target_language_code,
text_to_be_translated)
puts(translated_text)
################################################################################

I'd use the REST API directly rather than the whole Google API which seems to encompass everything but when I searched the source for 'translate' it turned up nothing.
Create a request, set the appropriate header as per these docs
Note: You can also use POST to invoke the API if you want to send more data in a single request. The q parameter in the POST body must be less than 5K characters. To use POST, you must use the X-HTTP-Method-Override header to tell the Translate API to treat the request as a GET (useX-HTTP-Method-Override: GET).
code (not tested) might look like
require 'net/http'
require 'net/https'
uri = URI.parse('https://www.googleapis.com/language/translate/v2')
request = Net::HTTP::Post.new(uri, {'X-HTTP-Method-Override' => 'GET'})
params = {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
request.set_form_data(params)
https = Net::HTTP.new(uri.host, uri.port)
https.use_ssl = true
response = https.request request
parsed = JSON.parse(response)

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.
The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.
It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

DRY search every page of a site with nokogiri

I want to search every page of a site. My thought is to find all links on a page that stay within the domain, visit them, and repeat. I'll have to implement measures to not repeat efforts as well.
So it starts very easily:
page = 'http://example.com'
nf = Nokogiri::HTML(open(page))
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
"main_links" is now an array of links from the active page that start with "/" (which should be links on the current domain only).
From here I can feed and read those links into similar code above, but I don't know the best way to ensure I don't repeat myself. I'm thinking I start collecting all the visited links as I visit them:
main_links.each do |ml|
visited_links = [] #new array of what is visted
np = Nokogiri::HTML(open(page + ml)) #load the first main_link
visted_links.push(ml) #push the page we're on
np_links = np.xpath('//a').map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq #grab all links on this page pointing to the current domain
main_links.push(np_links).compact.uniq #remove duplicates after pushing?
end
I'm still working out this last bit... but does this seem like the proper approach?
Thanks.
Others have advised you not to write your own web crawler. I agree with this if performance and robustness are your goals. However, it can be a great learning exercise. You wrote this:
"[…] but I don't know the best way to ensure I don't repeat myself"
Recursion is the key here. Something like the following code:
require 'set'
require 'uri'
require 'nokogiri'
require 'open-uri'
def crawl_site( starting_at, &each_page )
files = %w[png jpeg jpg gif svg txt js css zip gz]
starting_uri = URI.parse(starting_at)
seen_pages = Set.new # Keep track of what we've seen
crawl_page = ->(page_uri) do # A re-usable mini-function
unless seen_pages.include?(page_uri)
seen_pages << page_uri # Record that we've seen this
begin
doc = Nokogiri.HTML(open(page_uri)) # Get the page
each_page.call(doc,page_uri) # Yield page and URI to the block
# Find all the links on the page
hrefs = doc.css('a[href]').map{ |a| a['href'] }
# Make these URIs, throwing out problem ones like mailto:
uris = hrefs.map{ |href| URI.join( page_uri, href ) rescue nil }.compact
# Pare it down to only those pages that are on the same site
uris.select!{ |uri| uri.host == starting_uri.host }
# Throw out links to files (this could be more efficient with regex)
uris.reject!{ |uri| files.any?{ |ext| uri.path.end_with?(".#{ext}") } }
# Remove #foo fragments so that sub-page links aren't differentiated
uris.each{ |uri| uri.fragment = nil }
# Recursively crawl the child URIs
uris.each{ |uri| crawl_page.call(uri) }
rescue OpenURI::HTTPError # Guard against 404s
warn "Skipping invalid link #{page_uri}"
end
end
end
crawl_page.call( starting_uri ) # Kick it all off!
end
crawl_site('http://phrogz.net/') do |page,uri|
# page here is a Nokogiri HTML document
# uri is a URI instance with the address of the page
puts uri
end
In short:
Keep track of what pages you've seen using a Set. Do this not by href value, but by the full canonical URI.
Use URI.join to turn possibly-relative paths into the correct URI with respect to the current page.
Use recursion to keep crawling every link on every page, but bailing out if you've already seen the page.
You are missing some things.
A local reference can start with /, but it can also start with ., .. or even no special character, meaning the link is within the current directory.
JavaScript can also be used as a link, so you'll need to search throughout your document and find tags being used as buttons, then parse out the URL.
This:
links = nf.xpath '//a' #find all links on current page
main_links = links.map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
can be better written:
links.search('a[href^="/"]').map{ |a| a['href'] }.uniq
In general, don't do this:
....map{|l| l['href'] if l['href'] =~ /^\//}.compact.uniq
because it is very awkward. The conditional in the map results in nil entries in the resulting array, so don't do that. Use select or reject to reduce the set of links that meet your criteria, and then use map to transform them. In your use here, pre-filtering using ^= in the CSS makes it even easier.
Don't store the links in memory. You'll lose all progress if you crash or stop your code. Instead, at a minimum, use something like a SQLite database on disk as a data-store. Create a "href" field that is unique to avoid repeatedly hitting the same page.
Use Ruby's built-in URI class, or the Addressable gem, to parse and manipulate URLs. They save you work, and will do things the right way when you start encoding/decoding queries and trying to normalize the parameters to check for uniqueness, extracting and manipulating paths, etc.
Many sites use session IDs in the URL query to identify the visitor. That ID can make every link different if you start, then stop, then start again, or if you're not returning the cookies received from the site, so you have to return cookies, and figure out which query parameters are significant, and which are going to throw off your code. Keep the first and throw away the second when you store the links for later parsing.
Use a HTTP client like Typhoeus with Hydra to retrieve multiple pages in parallel, and store them in your database, with a separate process that parses them and feeds the URLs to parse back into the database. This can make a huge difference in your overall processing time.
Honor the site's robots.txt file, and throttle your requests to avoid beating up their server. Nobody likes bandwidth hogs and consuming a significant amount of a site's bandwidth or CPU time without permission is a good way to get noticed then banned. Your site will go to zero throughput at that point.
It's a more complicated problem than you seem to realize. Using a library along with Nokogiri is probably the way to go. Unless you're using windows (like me) you might want to look into Anemone.

Easy way to get the LON LAT of a list of cities

I have a list of cities in CSV format which I need to longitude and latitude for.
This is my CSV
GeoSearchString,Title
"New York", "Manhatten"
"Infinite Loop 1, San Francisco", "Apple Headquarter"
Now I am looking for an easy way to get coordinates for those places in JSON format
unless ARGV[0]
puts 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
exit
end
Can be rewritten as:
abort 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
I'd replace:
CSV.foreach(File.read(file), :headers => true) do |row|
results = []
csv.each do |row|
with:
results = []
CSV.foreach(File.read(file), :headers => true) do |row|
Be VERY careful with:
search = row['GeoSearchString'] rescue continue
title = row['Title'] rescue ''
A single in-line rescue is a loaded, very large caliber, gun pointed at your foot. You have two. In this particular case it might be safe, without unintended side effects, but in general you want to go there very carefully.
I came up with the following script (gist)
#!/usr/bin/env ruby
# encoding: utf-8
require 'geocoder'
require 'csv'
require 'json'
require 'yaml'
abort 'Usage ruby Geocoder.rb cities.csv' unless ARGV[0]
def main(file)
puts "Loading file #{file}"
csv = CSV.parse(File.read(file), :headers => true)
results = []
CSV.foreach(File.read(file), :headers => true) do |row|
# Hacky way to skip the current search string if no result is found
search = row['GeoSearchString'] rescue continue
# The title is optional
title = row['Title'] rescue ''
geo = Geocoder.search(search).first
if geo
results << {search: search, title: title, lon: geo.longitude, lat: geo.latitude}
end
end
puts JSON.pretty_generate(results)
end
main ARGV[0]
If you want your own database of cities, states, and other interesting areas (let's call them places) you can get that for free from the US Geological Survey website. It's called their Topical Gazetteer and has a massive amount of "places" along with geocodes. You can get a full national file which is 80MB or just the Populated Places which is 8MB. Additionally, you can download just the states you are interested in.
These places include:
Populated Places – Named features with human habitation—cities, towns, villages, etc. Subset of National file above.
Historical Features – Features that no longer exist on the landscape or no longer serve the original purpose. Subset of National file above.
Concise Features – Large features that should be labeled on maps with a scale of 1:250,000. Subset of National file above.(last updated October 2, 2009)
All Names – All names, both official and nonofficial (variant), for all features in the nation.
Feature Description/History – Includes the following additional feature attributes: Description and History. This file is not a standard topical gazetteer file. If you need these additional feature attributes, you will need to associate the data, using the feature id column, with the data in one of our other files, such as those under the "States, Territories, Associated Areas of the United States" section.
Antarctica Features – Features in Antarctica approved for use by the US government.
Government Units – Official short names, alphabetic, and numeric codes of States
This is not going to be ZIP codes, but actual cities and towns. Data that comes from the USPS will have lat/lon coordinates based on the centroid of a ZIP code (or group of ZIP codes that, when averaged, represent a city). Because it is ZIP based, the lat/lon will be different than the data that comes from the USGS. They're not interested in ZIP codes. Also keep in mind that ZIP codes change monthly, whenever the USPS needs to revise their delivery routes. Actual city locations really don't change. (ignoring nominal tectonic movement). So a definitive center point lat/lon may be best derived from the USGS data instead of the USPS ZIP-based weighted centroid.

How to loop throgh all links in div and collect values from opened fields

Is it possible to open every link in certain div and collect values of opened fields alltogether in one file or at least terminal output?
I am trying to get list of coordinates from all markers visible on google map.
all_links = b.div(:id, "kmlfolders").links
all_links.each do |link|
b.link.click
b.link(:text, "Norādījumi").click
puts b.text_field(:title, "Galapunkta_adrese").value
end
Are there easier or more effective ways how to automatically collect coordinates from all markers?
Unless there is other data (alt tags? elements invoked via onhover?) in the HTML already that you could pick through, that does seem like the most practical way to iterate through the links, however from what I can see you are not actually making use of the 'link' object inside your loop. You'd need something more like this I think
all_links = b.div(:id, "kmlfolders").links
all_links.each do |thelink|
b.link(:href => thelink.href).click
b.link(:text, "Norādījumi").click
puts b.text_field(:title, "Galapunkta_adrese").value
end
Probably using their API is a lot more effective means to get what you want however, it's why folks make API's after all, and if one is available, then using it is almost always best. Using a test tool as a screen-scraper to gather the info is liable to be a lot harder in the long run than learning how to make some api calls and get the data that way.
for web based api's and Ruby I find the REST-CLIENT gem works great, other folks like HTTP-Party
As I'm not already familiar with Google API, I find it hard for me to dig into API for one particular need. Therefor I made short watir-webdriver script for collecting coordinates of markers on protected google map. Resulting file is used in python script that creates speedcam files for navigation devices.
In this case it's speedcam map maintained and updated by Latvian police, but this script can probably be used with any google map just by replacing url.
# encoding: utf-8
require "rubygems"
require "watir-webdriver"
#b = Watir::Browser.new :ff
#--------------------------------
#b.goto "http://maps.google.com/maps?source=s_q&f=q&hl=lv&geocode=&q=htt%2F%2Fmaps.google.com%2Fmaps%2Fms%3Fmsid%3D207561992958290099079.0004b731f1c645294488e%26msa%3D0%26output%3Dkml&aq=&sll=56.799934,24.5753&sspn=3.85093,8.64624&ie=UTF8&ll=56.799934,24.5753&spn=3.610137,9.887695&z=7&vpsrc=0&oi=map_misc&ct=api_logo"
#b.div(:id, "kmlfolders").wait_until_present
all_markers = #b.div(:id, "kmlfolders").divs(:class, "fdrlt")
#prev_coordinates = 1
puts "#{all_markers.length} speedcam markers detected"
File.open("list_of_coordinates.txt","w") do |outfile|
all_markers.each do |marker|
sleep 1
marker.click
sleep 1
description = #b.div(:id => "iw_kml").text
#b.span(:class, "actbar-text").click
sleep 2
coordinates = #b.text_field(:name, "daddr").value
redo if coordinates == #prev_coordinates
puts coordinates
outfile.puts coordinates
#prev_coordinates = coordinates
end
end
puts "Coordinates saved in file!"
#b.close
Works both on Mac OSX 10.7 and Windows7.

How to read someone else's forum

My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying.
I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no trouble viewing my friend's forum, it seems that the method Net::HTTP.new("forumname.net") produces the following error:
No connection could be made because the target machine actively refused it. - connect(2)
Googling that error, I have learned that it has to do with MySQL (or something like that) not wanting nosy guys like me remotely poking around in there: for security reasons. This makes sense to me, but it makes me wonder: how is it that my browser gets to poke around on my friend's forum, but my little Ruby script gets no poking rights. Is there some way for my script to tell the server that it is not a threat? That I only want reading rights and not writing rights?
Thanks guys,
z.
Scraping a web site? Use mechanize:
#!/usr/bin/ruby1.8
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html # If you want to see the html you just got
posts = page.parser.xpath("//div[#class='postbody']")
for post in posts
title = post.at_xpath('h3//text()').to_s
author = post.at_xpath("p[#class='author']//a//text()").to_s
body = post.xpath("div[#class='content']//text()").collect do |div|
div.to_s
end.join("\n")
puts '-' * 40
puts "title: #{title}"
puts "author: #{author}"
puts "body:", body
end
The first part of the output:
----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.
some sites can only be accessed with the "www" subdomain, so that may be causing the problem.
to create a get request, you would want to use the Get method:
require 'net/http'
url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}
puts res.body
u might also need to set the user agent at some point as an option:
{'User-Agent' => 'Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})

Resources