Ruby Web API scraping / error handling with Hpricot - ruby

I have written a simple ruby gem to scrape a set of websites, providing a simple API, inside the gem itself I have included a retry method... to attempt to use Hpricot 3 or more times on failure mostly due to timeouts.
def retryable(options = {}, &block)
opts = { :tries => 1, :on => Exception }.merge(options)
retry_exception, retries = opts[:on], opts[:tries]
begin
return yield
rescue retry_exception
retry if (retries -= 1) > 0
end
yield
end
So now, in my Rails app which uses this gem i have created I'm wondering how I should handle errors should the Gem itself fail to produce a result, for whatever reason...
models/Available.rb
data = Whatever.find_item_by_id options
unless hwdata
raise "Web error "
end
I'm not quite sure how to handle this... at this point I don't really care about retrying, I only want a return a result, either a hash which the gem returns or returns false with some error?

For normal erros, Like 404, 500 etc, whatever mechanism you are using to fetch the website content, will throw the errors, for other reasons, the Gem should raise the other errors, and informs your rails app, where they can be handled, because its specific to that app.
The Gem should be as generic as possible, for reuse etc.

Related

Setting an HTTP Timeout in Ruby 1.9.3

I'm using Ruby 1.9.3 and need to GET a URL. I have this working with Net::HTTP, however, if the site is down, Net::HTTP ends up hanging.
While searching the internet, I've seen many people faced similar problems, all with hacky solutions. However, many of those posts are quite old.
Requirements:
I'd prefer using Net::HTTP to installing a new gem.
I need both the Body and the Response Code. (e.g. 200)
I do not want to require open-uri, since that makes global changes and raises some security issues.
I need to GET a URL within X seconds, or return error.
Using Ruby 1.9.3, how can I GET a URL while setting a timeout?
To clarify, my existing code looks like:
Net::HTTP.get_response(URI.parse(url))
Trying to add:
Net::HTTP.open_timeout(1000)
Results in:
NoMethodError: undefined method `open_timeout' for Net::HTTP:Class
You can set the open_timeout attribute of the Net::HTTP object before making the connection.
uri = URI.parse(url)
Net::HTTP.new(uri.hostname, uri.port) do |http|
http.open_timeout = 1000
response = http.request_get(uri.request_uri)
end
I tried all the solutions here and on the other questions about this problem but I only got everything right with the following code, The open-uri gem is a wrapper for net::http.
I needed a get that had to wait longer than the default timeout and read the response. The code is also simpler.
require 'open-uri'
open(url, :read_timeout => 5 * 60) do |response|
if response.read[/Return: Ok/i]
log "sending ok"
else
raise "error sending, no confirmation received"
end
end

Error handling using aspects in Ruby

I see myself handling similar exceptions in a rather similar fashion repeatedly and would like to use aspects to keep this error handling code outside of the core business logic. A quick search online pulled up a couple of ruby gems (aquarium, aspector, etc) but I don't see a whole lot of downloads for those gems in rubygems. Given that, I want to believe there are probably other nicer ways to deal with this in Ruby.
get '/products/:id' do
begin
product = find_product params[:id]
rescue Mongoid::Errors::DocumentNotFound
status 404
end
end
get '/users/:id' do
begin
user = find_user params[:id]
rescue Mongoid::Errors::DocumentNotFound
status 404
end
end
In the above example, there are 2 Sinatra routes that look for a requested object by ID in MongoDB and throw a 404 if the object were not to be found. Clearly, the code is repetitive and I am looking to find a Ruby way to make it DRY.
You can see answer in this guide.
You code example:
error Mongoid::Errors::DocumentNotFound do
status 404
end

Sinatra matches params[:id] as string type, additional conversion needed to match the database id?

I am using sinatra and DataMapper to access an sqlite3 database. I always get an nil when calling get(params[:id]). But when I call get(params[:id].to_i) I can get the right record. Is there anything wrong such that I have to do the conversion explicitly?
The sinatra app is simple:
class Record
include DataMapper::Resource
property :id, Serial
....
end
get '/list/:id' do
r = Record.get(params[:id])
...
end
Obviously this is a problem with Datamapper (if you believe it should be casting strings to numbers for id's), but there are ways Sinatra can mitigate it. When params come in you need to check:
They exist.
They're the right type (or castable).
They're within the range of values required or expected.
For example:
get '/list/:id' do
r = Record.get(params[:id].to_i)
# more codeā€¦
curl http://example.org/list/ddd
That won't work well, better to check and return an error message:
get '/list/:id' do |id| # the block syntax is helpful here
halt 400, "Supply an I.D. *number*" unless id =~ /\d+/
Then consider whether you want a default value, whether the value is in the right range etc. When taking in ID's I tend to use the regex syntax for routes, as it stops following sub routes being gobbled up too, while providing a bit of easy type checking:
get %r{/list/(\d+)} do |id|
Helpers are also useful in this situation:
helpers do
# it's not required to take an argument,
# the params helper is accessible inside other helpers
# it's but easier to test, and (perhaps) philosophically better.
def id( ps )
if ps[:id]
ps[:id].to_i
else
# raise an error, halt, or redirect, or whatever!
end
end
end
get '/list/:id' do
r = Record.get id(params)
To clarify, the comment in the original question by #mbj is correct. This is a bug in dm-core with Ruby 2.0. It worked fine with ruby 1.9. You are likely on dm-core version 1.2 and need 1.2.1, which you can get by running 'gem update dm-core'.

Ruby and Timeout.timeout performance issue

I'm not sure how to solve this big performance issue of my application. I'm using open-uri to request the most popular videos from youtube and when I ran perftools https://github.com/tmm1/perftools.rb
It shows that the biggest performance issue is Timeout.timeout. Can anyone suggest me how to solve the problem?
I'm using ruby 1.8.7.
Edit:
This is the output from my profiler
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4bANr--YcONZDRlMmFhZjQtYzIyOS00YjZjLWFlMGUtMTQyNzU5ZmYzZTU4&hl=en_US
Timeout is wrapping the function that is actually doing the work to ensure that if the server fails to respond within a certain time, the code will raise an error and stop execution.
I suspect that what you are seeing is that the server is taking some time to respond. You should look at caching the response in some way.
For instance, using memcached (pseudocode)
require 'dalli'
require 'open-uri'
DALLI = Dalli.client.new
class PopularVideos
def self.get
result = []
unless result = DALLI.get("videos_#{Date.today.to_s}")
doc = open("http://youtube/url")
result = parse_videos(doc) # parse the doc somehow
DALLI.set("videos_#{Date.today.to_s}", result)
end
result
end
end
PopularVideos.get # calls your expensive parsing script once
PopularVideos.get # gets the result from memcached for the rest of the day

Is it possible to access Ruby EventMachine Channels from Thin/Rack/Sinatra?

I'm looking to build a simple, RESTful notification system for an internal project leveraging Sinatra. I've used EventMachine channels in the past to subscribe/publish to events, but in all my previous cases I was using EventMachine directly.
Does anyone know if it's possible to create, subscribe, and publish to EventMachine channels (running in Thin) from a Sinatra application, or even from some Rack middleware for that matter?
Have a look at async_sinatra.
Basically, to make it possible to use EventMachine when running in Thin you need to make it aware that you want to serve requests asynchronously. The Rack protocol is synchronous by design, and Thin expects a request to be done when the handler returns. There are ways to make Thin aware that you want to handle the request asynchronously (see think_async for an example how), and async_sinatra makes it very easy.
Bryan,
You can use the em-http-request library (https://github.com/igrigorik/em-http-request), this will allow you to reference a specific EventMachine application running on either A. the same server, B. a different server, or C. wherever you want really.
require 'eventmachine'
require 'em-http-request'
require 'sinatra/base'
require 'thin'
class ServerClass < EventMachine::Connection
def initialize(*args)
# ruby singleton - store channel data in global hash
($channels ||= [])
end
def post_init
puts "initialized"
$cb.call("initialized");
end
def receive_data(data)
# got information from client connection
end
def channel_send(msg,channel)
$channels[channel].send_data(msg)
end
def channels_send(msg)
$channels.each{|channel| channel.send_data(msg)}
end
def unbind
# puts user left
end
end
EventMachine.run do
$cb = EM.callback {|msg| puts msg #do something creative}
$ems = EventMachine::start_server('0.0.0.0',ServerClass,args)
class App < Sinatra::Base
set :public, File.dirname(__FILE__) + '/public'
get '/' do
erb :index
end
end
App.run!({:port => 3000})
end
Above is a basic wireframe. Depending on how you want to go about sending data, you can use WebSockets (em-websocket) and bind each user on login (have to add a login system), or you can use this for whatever. As long as you have a global reference to the Eventmachine Object (connection, websocket, channel) you can pass messages from within your application.
BTW - It is optional to add the EventMachine.run do;....end loop, since Thin will do this anyways. It helps to know how it works though.
Good Luck

Resources