How to read someone else's forum - ruby

My friend has a forum, which is full of posts containing information. Sometimes she wants to review the posts in her forum, and come to conclusions. At the moment she reviews posts by clicking through her forum, and generates a not necessarily accurate picture of the data (in her brain) from which she makes conclusions. My thought today was that I could probably bang out a quick Ruby script that would parse the necessary HTML to give her a real idea of what the data is saying.
I am using Ruby's net/http library for the first time today, and I have encountered a problem. While my browser has no trouble viewing my friend's forum, it seems that the method Net::HTTP.new("forumname.net") produces the following error:
No connection could be made because the target machine actively refused it. - connect(2)
Googling that error, I have learned that it has to do with MySQL (or something like that) not wanting nosy guys like me remotely poking around in there: for security reasons. This makes sense to me, but it makes me wonder: how is it that my browser gets to poke around on my friend's forum, but my little Ruby script gets no poking rights. Is there some way for my script to tell the server that it is not a threat? That I only want reading rights and not writing rights?
Thanks guys,
z.

Scraping a web site? Use mechanize:
#!/usr/bin/ruby1.8
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html # If you want to see the html you just got
posts = page.parser.xpath("//div[#class='postbody']")
for post in posts
title = post.at_xpath('h3//text()').to_s
author = post.at_xpath("p[#class='author']//a//text()").to_s
body = post.xpath("div[#class='content']//text()").collect do |div|
div.to_s
end.join("\n")
puts '-' * 40
puts "title: #{title}"
puts "author: #{author}"
puts "body:", body
end
The first part of the output:
----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.

some sites can only be accessed with the "www" subdomain, so that may be causing the problem.
to create a get request, you would want to use the Get method:
require 'net/http'
url = URI.parse('http://www.forum.site/')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}
puts res.body
u might also need to set the user agent at some point as an option:
{'User-Agent' => 'Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1'})

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.
The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.
It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

EventMachine not catching nearly simultaneous events

I'm using EventMachine to process incoming emails which could at times be very high volume. The code that I have so far definitely works for emails that come in separated by at least about 5 seconds, but somewhere below that, only one email will be processed out of however many arrive. I've tried adding EM.defer statements in a few different places which I thought would help, but to no avail. I should also note, if it makes any difference, that I'm using the em-imap gem in this example as well.
The relevant section of the code is here:
EM.run do
client = EM::IMAP.new('imap.gmail.com', 993, true)
client.connect.bind! do
client.login('me#email.com', 'password123')
end.bind! do
client.select('INBOX')
end.bind! do
client.wait_for_new_emails do |response|
client.fetch(response.data).callback do |fetched|
currentSubjectLine = fetched.first.attr.values[1].subject
desiredCommand = parseSubjectLine(currentSubjectLine)
if desiredCommand == 0
if fetched.first.attr.values[0].parts.length == 2
if fetched.first.attr.values[0].parts[1].subtype.downcase != "pdf"
puts 'Error: Missing attachment, or attachment of the wrong type.'
else
file_name = fetched.first.attr.values[0].parts[1].param.values[0]
client.fetch(response.data, "BODY[2]").callback do |attachments|
attachment = attachments[0].attr["BODY[2]"]
File.new(file_name,'wb+').write(Base64.decode64(attachment))
end
end...
Am I somehow blocking the reactor in this code segment? Is it possible that some library that I'm using isn't appropriate here? Could GMail's IMAP server have something to do with it? Do you need any more information about what happens in some given situation before you can answer with confidence? As always, any help is greatly appreciated. Thank you!
Update with Minimized Code
Just in case anything in my organization has anything to do with it, I'm including everything that I think might possibly be relevant.
module Processing
def self.run
EM.run do
client = EM::IMAP.new('imap.gmail.com', 993, true)
client.connect.bind! do
client.login('me#email.com', 'password123')
end.bind! do
client.select('INBOX')
end.bind! do
client.wait_for_new_emails do |response|
client.fetch(response.data).callback do |fetched|
puts fetched[0].attr.values[1].subject
end
end
end.errback do |error|
puts "Something failed: #{error}"
end
end...
Processing.run
Don't hate me for saying this, but refactor that pyramid of doom spaggheti thingy that makes Demeter twitch into something readable and the error will reveal itself :)
If it doesn't reveal itself you will be able to boil it down to the simplest possible code that reproduces the problem and submit it as an issue to https://github.com/eventmachine/eventmachine
However, EM isn't really supported any more, the devs went a bit awol so think about moving to https://github.com/celluloid/celluloid and https://github.com/celluloid/celluloid-io
PS
just saw this
File.new(file_name,'wb+').write(Base64.decode64(attachment))
is a blocking call afaik, try playing with this and you might be able to reproduce the issue. See https://github.com/martinkozak/em-files and http://eventmachine.rubyforge.org/EventMachine.html#defer-class_method on possible ways to go around this

Google Translate API call in Ruby generates "URL too Large" error

The following Ruby code accesses Google Translate's API:
require 'google/api_client'
client = Google::APIClient.new
translate = client.discovered_api('translate','v2')
client.authorization.access_token = '123' # dummy
client.key = "<My Google API Key>"
response = client.execute(
:api_method => translate.translations.list,
:parameters => {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
)
The response comes back in JSON format.
This code works for text-to-be-translated (the q argument) of less than approximately 750 characters, but generates the following error for longer text:
Error 414 (Request-URI Too Large)!!1Error 414 (Request-URI Too Large)!!1
I have Googled this problem and found the following pages:
https://developers.google.com/translate/
https://github.com/google/google-api-ruby-client
It seems that the Google API Client code is placing the call to the Google API
using a GET instead of a POST. As various systems place limits on the length of URLs,
this restricts the length of text that can be translated.
The form of the solution to this problem seems to be that I should instruct
the Google API interface to place the call to the Google API using a POST
instead of a GET. However, I so far haven't been able to figure out how to do this.
I see that there are a few other StackOverflow questions relating to this
issue in other languages. However, I so far haven't been able to figure out how
to apply it to my Ruby code.
Thanks very much Dave Sag. Your approach worked, and I have developed it
into a working example that other Stack Overflow users might find useful.
(Note: Stack Overflow won't let me post this as an answer for eight hours, so I'm
editing my question instead.) Here is the example:
#!/usr/local/rvm/rubies/ruby-1.9.3-p194/bin/ruby
#
# This is a self-contained Unix/Linux Ruby script.
#
# Replace the first line above with the location of your ruby executable. e.g.
#!/usr/bin/ruby
#
################################################################################
#
# Author : Ross Williams.
# Date : 1 January 2014.
# Version : 1.
# Licence : Public Domain. No warranty or liability.
#
# WARNING: This code is example code designed only to help Stack Overflow
# users solve a specific problem. It works, but it does not contain
# comprehensive error checking which would probably double the length
# of the code.
require 'uri'
require 'net/http'
require 'net/https'
require 'rubygems'
require 'json'
################################################################################
#
# This method translates a string of up to 5K from one language to another using
# the Google Translate API.
#
# google_api_key
# This is a string of about 39 characters which identifies you
# as a user of the Google Translate API. At the time of writing,
# this API costs US$20/MB of source text. You can obtain a key
# by registering at:
# https://developers.google.com/translate/
# https://developers.google.com/translate/v2/getting_started
#
# source_language_code
# target_language_code
# These arguments identify the source and target language.
# Each of these arguments should be a string containing one
# of Google's two-letter language identification codes.
# A list of these codes can be found at:
# https://sites.google.com/site/opti365/translate_codes
#
# text_to_be_translated
# This is a string to be translated. Ruby provides excellent
# Unicode support and this string can contain non-ASCII characters.
# This string must not be longer than 5K characters long.
#
def google_translate(google_api_key,source_language_code,target_language_code,text_to_be_translated)
# Note: The 5K limit on text_to_be_translated is stated at:
# https://developers.google.com/translate/v2/using_rest?hl=ja
# It is not clear to me whether this is 5*1024 or 5*1000.
# The Google Translate API is served at this HTTPS address.
# See http://stackoverflow.com/questions/13152264/sending-http-post-request-in-ruby-by-nethttp
uri = URI.parse("https://www.googleapis.com/language/translate/v2")
request = Net::HTTP::Post.new(uri.path)
# The API accepts only the GET method. However, the GET method has URL length
# limitations so we have to use POST method instead. We need to use a trick for
# deliverying the form fields using POST but having the API perceive it as a GET.
# Setting a key/value pair in request adds a field to the HTTP header to do this.
request['X-HTTP-Method-Override'] = 'GET'
# Load the arguments to the API into the POST form fields.
params = {
'access_token' => '123', # Dummy parameter seems to be required.
'key' => google_api_key,
'format' => 'text',
'source' => source_language_code,
'target' => target_language_code,
'q' => text_to_be_translated
}
request.set_form_data(params)
# Execute the request.
# It's important to use HTTPS, as the API key is being transmitted.
https = Net::HTTP.new(uri.host,uri.port)
https.use_ssl = true
response = https.request(request)
# The API returns a record in JSON format.
# See http://en.wikipedia.org/wiki/JSON
# See http://www.json.org/
json_text = response.body
# Parse the JSON record, yielding a nested hash/list data structure.
json_structure = JSON.parse(json_text)
# Navigate down into the data structure to get the result.
# This navigation was coded by examining the JSON text from an actual run.
data_hash = json_structure['data']
translations_list = data_hash['translations']
translation_hash = translations_list[0]
translated_text = translation_hash['translatedText']
return translated_text
end # def google_translate
################################################################################
google_api_key = '<INSERT YOUR GOOGLE TRANSLATE API KEY HERE>'
source_language_code = 'en' # English
target_language_code = 'fr' # French
# To test the code, I have chosen a sample text of about 3393 characters.
# This is large enough to exceed the GET URL length limit, but small enough
# not to exceed the Google Translate API length limit of 5K.
# Sample text is from http://www.gutenberg.org/cache/epub/2701/pg2701.txt
text_to_be_translated = <<END
Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
There now is your insular city of the Manhattoes, belted round by
wharves as Indian isles by coral reefs--commerce surrounds it with
her surf. Right and left, the streets take you waterward. Its extreme
downtown is the battery, where that noble mole is washed by waves, and
cooled by breezes, which a few hours previous were out of sight of land.
Look at the crowds of water-gazers there.
Circumambulate the city of a dreamy Sabbath afternoon. Go from Corlears
Hook to Coenties Slip, and from thence, by Whitehall, northward. What
do you see?--Posted like silent sentinels all around the town, stand
thousands upon thousands of mortal men fixed in ocean reveries. Some
leaning against the spiles; some seated upon the pier-heads; some
looking over the bulwarks of ships from China; some high aloft in the
rigging, as if striving to get a still better seaward peep. But these
are all landsmen; of week days pent up in lath and plaster--tied to
counters, nailed to benches, clinched to desks. How then is this? Are
the green fields gone? What do they here?
But look! here come more crowds, pacing straight for the water, and
seemingly bound for a dive. Strange! Nothing will content them but the
extremest limit of the land; loitering under the shady lee of yonder
warehouses will not suffice. No. They must get just as nigh the water
as they possibly can without falling in. And there they stand--miles of
them--leagues. Inlanders all, they come from lanes and alleys, streets
and avenues--north, east, south, and west. Yet here they all unite.
Tell me, does the magnetic virtue of the needles of the compasses of all
those ships attract them thither?
Once more. Say you are in the country; in some high land of lakes. Take
almost any path you please, and ten to one it carries you down in a
dale, and leaves you there by a pool in the stream. There is magic
in it. Let the most absent-minded of men be plunged in his deepest
reveries--stand that man on his legs, set his feet a-going, and he will
infallibly lead you to water, if water there be in all that region.
Should you ever be athirst in the great American desert, try this
experiment, if your caravan happen to be supplied with a metaphysical
professor. Yes, as every one knows, meditation and water are wedded for
ever.
END
translated_text =
google_translate(google_api_key,
source_language_code,
target_language_code,
text_to_be_translated)
puts(translated_text)
################################################################################
I'd use the REST API directly rather than the whole Google API which seems to encompass everything but when I searched the source for 'translate' it turned up nothing.
Create a request, set the appropriate header as per these docs
Note: You can also use POST to invoke the API if you want to send more data in a single request. The q parameter in the POST body must be less than 5K characters. To use POST, you must use the X-HTTP-Method-Override header to tell the Translate API to treat the request as a GET (useX-HTTP-Method-Override: GET).
code (not tested) might look like
require 'net/http'
require 'net/https'
uri = URI.parse('https://www.googleapis.com/language/translate/v2')
request = Net::HTTP::Post.new(uri, {'X-HTTP-Method-Override' => 'GET'})
params = {
'format' => 'text',
'source' => 'eng',
'target' => 'fre',
'q' => 'This is the text to be translated'
}
request.set_form_data(params)
https = Net::HTTP.new(uri.host, uri.port)
https.use_ssl = true
response = https.request request
parsed = JSON.parse(response)

Determining the parameters for "discovered_api"

This might be a really stupid question, but how do you determine the parameters to pass in the client "discovered_api" call? and then what is the executed "api_method"
For instance, I'm trying to call the Admin SDK Users List call which is "GET https://www.googleapis.com/admin/directory/v1/users".
There doesn't seem to be a clear way of extracting this from the API reference, or am I just looking in the wrong place?
I misunderstood the original question. I still think the other post is potentially valuable so I thought I'd add a new answer.
I experimented a bit and this will display the Title, id and whether or not it is preferred. The ID has a colon which seems to be where you separate the first and second argument when calling discovered_api.
puts "Title \t ID \t Preferred"
client.discovered_apis.each do |gapi|
puts "#{gapi.title} \t #{gapi.id} \t #{gapi.preferred}"
end
I had this exact question. And for methods like get I figured it out.
Create your client and then do the following
api = client.discovered_api("admin", "directory_v1")
puts "--- Users List ---"
puts api.users.list.parameters
puts "--- Users Get ---"
puts api.users.get.parameters
This will print off the parameters. You can also use api.users.get.parameter_descriptions
Something that could be helpful if you are trying to probe into issues like this is to print off all the available methods. I typically do it like this.
puts api.users.insert.methods - Object.methods
If you try that one you will see that api.users.insert has the following methods after you take away the ones that are common to every object.
discovery_document
api
method_base
method_base=
description
id
http_method
uri_template
media_upload
request_schema
response_schema
normalize_parameters
generate_uri
generate_request
parameter_descriptions
parameters
required_parameters
optional_parameters
validate_parameters
I hope that helps.
James

Nokogiri How can I extract text from HTML with correct spacing?

I'm trying to extract the text for a document to index it for search. The below mostly works except various words and punctuation run together. When it removes tags, I need to replace them with spaces so I do not get this issue. I have been trying to figure out the most efficient way to do this but I'm coming up empty so far.
doc = Nokogiri::HTML(html)
doc.xpath("//script").remove
doc.xpath("//style").remove
doc.xpath("//a").remove
text = doc.text.gsub(/\s+/,' ')
Here is some sample text I extracted from http://www.washingtontimes.com/blog/redskins-watch/2012/oct/18/redskins-linemen-respond-jason-pierre-paul-rg3-com/
Before the season it was New York Giants defensive end Osi Umenyiora
who made waves by saying he wouldn't call Robert Griffin III by “RG3”
until he did something. Until then, it was “Bob Griffin.”After
Griffin's 76-yard touchdown run in the Washington Redskins' victory
over the Minnesota Vikings, fellow Giants defensive end Jason
Pierre-Paul was the one who had some comments about Griffin.“Don’t
bring it to my side," Pierre-Paul told New York media. “Go the other
way. …“Yes, it'll be a very good matchup. Not on my side, though. Not
on my side. Or the other side.”Griffin, asked jokingly Wednesday about
running for office, said: “I’ve got a lot other guys to be running
away from right now, Pierre-Paul, Osi, all those guys.”But according
to a couple of Redskins linemen, Griffin shouldn't have much to worry
about Sunday if he gets into the open field.“If Robert gets into that
situation, I don't think there's many people that can run him down,”
right guard Chris Chester said. “I'm still going to go out there and
try to block and make sure no one touches Robert at all. But he's a
plenty good athlete to be able to outrun a lot of people in this
league.”Prompted with Pierre-Paul's comments, left tackle Trent
Williams responded: “What do you want me to say about that?”“Robert's
my guy. I don't know Pierre-Paul. I don't know why he would say
something like that,” he said. “Maybe he knows something I don't.”
You could try inserting a space before each p tag:
doc.search('p').each{|el| el.before ' '}
but a better approach probably is something like:
text = doc.search('div.story p').map{|p| p.text}.join(" ")
Other answers are discussing inserting whitespace into the document, but if (as the question asks) your requirement is to replace those nodes with whitespace, Nokogiri has a replace method. So to replace script tags with spaces do:
doc.xpath('//script').each do |node|
node.replace(' ')
end
The question also asks about 'correct' spacing. Most browsers will not insert a space when they render around a <script> tag, so while useful for text extraction, this is not necessarily the 'correct' thing to do.

Resources