Adjusting timeouts for Nokogiri connections - ruby

Why nokogiri waits for couple of secongs (3-5) when the server is busy and I'm requesting pages one by one, but when these request are in a loop, nokogiri does not wait and throws the timeout message.
I'm using timeout block wrapping the request, but nokogiri does not wait for that time at all.
Any suggested procedure on this?
# this is a method from the eng class
def get_page(url,page_type)
begin
timeout(10) do
# Get a Nokogiri::HTML::Document for the page we’re interested in...
##doc = Nokogiri::HTML(open(url))
end
rescue Timeout::Error
puts "Time out connection request"
raise
end
end
# this is a snippet from the main app calling eng class
# receives a hash with urls and goes throgh asking one by one
def retrieve_in_loop(links)
(0..links.length).each do |idx|
url = links[idx]
puts "Visiting link #{idx} of #{links.length}"
puts "link: #{url}"
begin
##eng.get_page(url, product)
rescue Exception => e
puts "Error getting url: #{idx} #{url}"
puts "This link will be skeeped. Continuing with next one"
end
end
end

The timeout block is simply the max time that that code has to execute inside the block without triggering an exception. It does not affect anything inside Nokogiri or OpenURI.
You can set the timeout to a year, but OpenURI can still time out whenever it likes.
So your problem is most likely that OpenURI is timing out on the connection attempt itself. Nokogiri has no timeouts; it's just a parser.
Adjusting read timeout
The only timeout you can adjust on OpenURI is the read timeout. It seems you cannot change the connection timeout through this method:
open(url, :read_timeout => 10)
Adjusting connection timeout
To adjust the connection timeout you would have to go with Net::HTTP directly instead:
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 10
http.read_timeout = 10
response = http.get(uri.path)
Nokogiri.parse(response.body)
You can also take a look at some additional discussion here:
Ruby Net::HTTP time out
Increase timeout for Net::HTTP

Related

web server in ruby and connection keep-alive

Web server example:
require 'rubygems'
require 'socket'
require 'thread'
class WebServer
LINE_TERMINATOR = "\r\n".freeze
def initialize(host, port)
#server = TCPServer.new(host, port)
end
def run
response_body = 'Hello World!'.freeze
response_headers = "HTTP/1.1 200 OK#{LINE_TERMINATOR}Connection: Keep-Alive#{LINE_TERMINATOR}Content-Length: #{response_body.bytesize}#{LINE_TERMINATOR}".freeze
loop do
Thread.new(#server.accept) do |socket|
puts "request #{socket}"
sleep 3
socket.setsockopt(Socket::IPPROTO_TCP, Socket::TCP_NODELAY, 1)
socket.write(response_headers)
socket.write(LINE_TERMINATOR)
socket.write(response_body)
# socket.close # if this line is uncommented then it's work.
end
end
end
end
WebServer.new('localhost', 8888).run
if update browser without waiting for the end of the cycle then the following queries are not processed
How can handle incomming request which are persistent socket ?
You need to:
Keep around the sockets you get from the #server.accept call. Store them in an array (socket_array).
Use the IO.select call on the array of sockets to get the set of sockets that can be read:
ready = IO.select(socket_array)
readable = ready[0]
readable.each do |socket|
# Read from socket here
# Do the rest of processing here
Don't close the socket after you have sent the data.
If you need more details leave a comment - I can write more of the code.

How to download a binary file via Net::HTTP::Get?

I am trying to download a binary file via HTTP using the following Ruby script.
#!/usr/bin/env ruby
require 'net/http'
require 'uri'
def http_download(resource, filename, debug = false)
uri = URI.parse(resource)
puts "Starting HTTP download for: #{uri}"
http_object = Net::HTTP.new(uri.host, uri.port)
http_object.use_ssl = true if uri.scheme == 'https'
begin
http_object.start do |http|
request = Net::HTTP::Get.new uri.request_uri
Net::HTTP.get_print(uri) if debug
http.read_timeout = 500
http.request request do |response|
open filename, 'w' do |io|
response.read_body do |chunk|
io.write chunk
end
end
end
end
rescue Exception => e
puts "=> Exception: '#{e}'. Skipping download."
return
end
puts "Stored download as #{filename}."
end
However it downloads the HTML source instead of the binary. When I enter the URL in the browser the binary file is downloaded. Here is a URL with which the script fails:
http://dcatlas.dcgis.dc.gov/catalog/download.asp?downloadID=2175&downloadTYPE=KML
I execute the script as follows
pry> require 'myscript'
pry> resource = "http://dcatlas.dcgis.dc.gov/catalog/download.asp?downloadID=2175&downloadTYPE=KML"
pry> http_download(resource,"StreetTreePt.KML", true)
How can I download the binary?
Redirection experiments
I found this redirection check which looks quite reasonable. When I integrate in the response block it fails with the following error:
Exception: 'undefined method `host' for "save_download.asp?filename=StreetTreePt.KML":String'. Skipping download.
The exception does not occur in the "original" function posted above.
The documentation for Net::HTTP shows how to handle redirects:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
Or, you can use Ruby's OpenURI, which handles it automatically. Or, the Curb gem will do it. Probably Typhoeus and HTTPClient too.
According to the code you show in your question, the exception you are getting can only come from:
http_object = Net::HTTP.new(uri.host, uri.port)
which is hardly likely since uri is a URI object. You need to show the complete code if you want help with that problem.

Getting an error when looping

I'm writing some code that will pulls URLS from a text file and then check to see if they load or not. The code I have is:
require 'rubygems'
require 'watir'
require 'timeout'
Watir::Browser.default = "firefox"
browser = Watir::Browser.new
File.open('pl.txt').each_line do |urls|
begin
Timeout::timeout(10) do
browser.goto(urls.chomp)
if browser.text.include? "server"
puts 'here the page didnt'
else
puts 'here site was found'
File.open('works.txt', 'a') { |f| f.puts urls }
end
end
rescue Timeout::Error => e
puts e
end
end
browser.close
The thing is though I get the error:
execution expired
/Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/jssh_socket.rb:19:in `const_get': wrong number of arguments (2 for 1) (ArgumentError)
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/jssh_socket.rb:19:in `js_eval'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:303:in `open_window'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:94:in `get_window_number'
from /Library/Ruby/Gems/1.8/gems/firewatir-1.9.4/lib/firewatir/firefox.rb:103:in `goto'
from samplecodestack.rb:17
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/timeout.rb:62:in `timeout'
from samplecodestack.rb:16
from samplecodestack.rb:13:in `each_line'
from samplecodestack.rb:13
Anyone know how to get it working?
You can use net/http and handle the timeouts too.
require "net/http"
require "uri"
File.open('pl.txt').each_line do |urls|
uri = URI.parse(urls.chomp)
begin
response = Net::HTTP.get_response(uri)
rescue Exception=> e
puts e.message
puts "did not load!"
end
end
I had trouble following your stack trace but it seems to be on your goto statement.
execution expired is the error that occurs when the block for the Timeout::timeout is exceeded. Note that the timeout is checking that its entire block is completed in the specified time. Given the line number errors, I am guessing that the URL being loaded took close to 10 seconds and then the text check timed out.
I assume you really only mean for the timeout to occur if the page takes longer than 10 seconds to load, rather than the entire test taking 10 seconds to finish. So you should move the if statement out of the Timeout block:
File.open('pl.txt').each_line do |urls|
begin
Timeout::timeout(10) do
browser.goto(urls.chomp)
end
if browser.text.include? "server"
puts 'here the page didnt'
else
puts 'here site was found'
File.open('works.txt', 'a') { |f| f.puts urls }
end
rescue Timeout::Error => e
puts 'here the page took too long to load'
puts e
end
end

How to tell a connect timeout error from a read timeout error in Ruby's Net::HTTP

My question is related to How to rescue timeout issues (Ruby, Rails).
Here's the common way to rescue from a timeout:
def action
# Post using Net::HTTP
rescue Timeout::Error => e
# Do something
end
I'd like to determine if the exception was raised while trying to connect to the host, or if it was raised while trying to read from the host. Is this possible?
Here's the solution (after Ben's fix):
require "net/http"
http = Net::HTTP.new("example.com")
http.open_timeout = 2
http.read_timeout = 3
begin
http.start
begin
http.request_get("/whatever?") do |res|
res.read_body
end
rescue Timeout::Error
puts "Timeout due to reading"
end
rescue Timeout::Error
puts "Timeout due to connecting"
end
Marc-André Lafortune's solution is still the best if you can't upgrade to ruby 2.x.
Starting from 2.x, a subclass of Timeout::Error will be raised depending on which timeout was triggered:
Net::OpenTimeout
Net::ReadTimeout
However, the read_timeout behavior is strange on 2.x, because it seems to double the value you set. This article explains why.
Here's a test for both timeouts (tested on 1.8.7, 1.9.3, 2.1.2, 2.2.4).
EDIT: The open_timeout test works on Mac, but on Linux, the client gets a "connection refused" error.
require "net/http"
require "socket"
SERVER_HOST = '127.0.0.1'
SERVER_PORT = 9999
def main
puts 'with_nonlistening_server'
with_nonlistening_server do
make_request
end
puts
puts 'with_listening_server'
with_listening_server do
make_request
end
end
def with_listening_server
# This automatically starts listening
serv = TCPServer.new(SERVER_HOST, SERVER_PORT)
begin
yield
ensure
serv.close
end
end
def with_nonlistening_server
raw_serv = Socket.new Socket::AF_INET, Socket::SOCK_STREAM, 0
addr = Socket.pack_sockaddr_in SERVER_PORT, SERVER_HOST
# Bind, but don't listen
raw_serv.bind addr
begin
yield
ensure
raw_serv.close
end
end
def make_request
http = Net::HTTP.new(SERVER_HOST, SERVER_PORT)
http.open_timeout = 1
http.read_timeout = 1 # seems to be doubled on ruby 2.x
start_tm = Time.now
begin
http.start
begin
http.get('/')
rescue Timeout::Error => err
puts "Read timeout: #{err.inspect}"
end
rescue Timeout::Error => err
puts "Open timeout: #{err.inspect}"
end
end_tm = Time.now
puts "Duration (sec): #{end_tm - start_tm}"
end
if __FILE__ == $PROGRAM_NAME
main
end
Example output on 1.9.3:
with_nonlistening_server
Open timeout: #<Timeout::Error: execution expired>
Duration (sec): 1.002477
with_listening_server
Read timeout: #<Timeout::Error: Timeout::Error>
Duration (sec): 1.00599
Example output on 2.1.2:
with_nonlistening_server
Open timeout: #<Net::OpenTimeout: execution expired>
Duration (sec): 1.005923
with_listening_server
Read timeout: #<Net::ReadTimeout: Net::ReadTimeout>
Duration (sec): 2.009582

How to download via HTTP only piece of big file with ruby

I only need to download the first few kilobytes of a file via HTTP.
I tried
require 'open-uri'
url = 'http://example.com/big-file.dat'
file = open(url)
content = file.read(limit)
But it actually downloads the full file.
This seems to work when using sockets:
require 'socket'
host = "download.thinkbroadband.com"
path = "/1GB.zip" # get 1gb sample file
request = "GET #{path} HTTP/1.0\r\n\r\n"
socket = TCPSocket.open(host,80)
socket.print(request)
# find beginning of response body
buffer = ""
while !buffer.match("\r\n\r\n") do
buffer += socket.read(1)
end
response = socket.read(100) #read first 100 bytes of body
puts response
I'm curious if there is a "ruby way".
This is an old thread, but it's still a question that seems mostly unanswered according to my research. Here's a solution I came up with by monkey-patching Net::HTTP a bit:
require 'net/http'
# provide access to the actual socket
class Net::HTTPResponse
attr_reader :socket
end
uri = URI("http://www.example.com/path/to/file")
begin
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new(uri.request_uri)
# calling request with a block prevents body from being read
http.request(request) do |response|
# do whatever limited reading you want to do with the socket
x = response.socket.read(100);
end
end
rescue IOError
# ignore
end
The rescue catches the IOError that's thrown when you call HTTP.finish prematurely.
FYI, the socket within the HTTPResponse object isn't a true IO object (it's an internal class called BufferedIO), but it's pretty easy to monkey-patch that, too, to mimic the IO methods you need. For example, another library I was using (exifr) needed the readchar method, which was easy to add:
class Net::BufferedIO
def readchar
read(1)[0].ord
end
end
Check out "OpenURI returns two different objects". You might be able to abuse the methods in there to interrupt downloading/throw away the rest of the result after a preset limit.

Resources