How to download via HTTP only piece of big file with ruby - ruby

I only need to download the first few kilobytes of a file via HTTP.
I tried
require 'open-uri'
url = 'http://example.com/big-file.dat'
file = open(url)
content = file.read(limit)
But it actually downloads the full file.

This seems to work when using sockets:
require 'socket'
host = "download.thinkbroadband.com"
path = "/1GB.zip" # get 1gb sample file
request = "GET #{path} HTTP/1.0\r\n\r\n"
socket = TCPSocket.open(host,80)
socket.print(request)
# find beginning of response body
buffer = ""
while !buffer.match("\r\n\r\n") do
buffer += socket.read(1)
end
response = socket.read(100) #read first 100 bytes of body
puts response
I'm curious if there is a "ruby way".

This is an old thread, but it's still a question that seems mostly unanswered according to my research. Here's a solution I came up with by monkey-patching Net::HTTP a bit:
require 'net/http'
# provide access to the actual socket
class Net::HTTPResponse
attr_reader :socket
end
uri = URI("http://www.example.com/path/to/file")
begin
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new(uri.request_uri)
# calling request with a block prevents body from being read
http.request(request) do |response|
# do whatever limited reading you want to do with the socket
x = response.socket.read(100);
end
end
rescue IOError
# ignore
end
The rescue catches the IOError that's thrown when you call HTTP.finish prematurely.
FYI, the socket within the HTTPResponse object isn't a true IO object (it's an internal class called BufferedIO), but it's pretty easy to monkey-patch that, too, to mimic the IO methods you need. For example, another library I was using (exifr) needed the readchar method, which was easy to add:
class Net::BufferedIO
def readchar
read(1)[0].ord
end
end

Check out "OpenURI returns two different objects". You might be able to abuse the methods in there to interrupt downloading/throw away the rest of the result after a preset limit.

Related

How to send a request to localhost using 'net/http' fails with end of file reached (EOFError)

I'm using Ruby version 2.3.0.
I want to check when my application is up, and I wrote this method for my "deployer".
At runtime http.request_get(uri) raises
EOFError: end of file reached
when I pass http://localhost as a first argument into the method:
require 'net/http'
def check_start_application(address, port)
success_codes = [200, 301]
attempts = 200
uri = URI.parse("#{address}:#{port}")
http = Net::HTTP.new(uri.host, uri.port)
attempts.times do |attempt|
# it raises EOFError: end of file reached
http.request_get(uri) do |response|
if success_codes.include?(response.code.to_i)
return true
elsif attempt == attempts - 1
return false
end
end
end
end
But, when I test this method separately from a context with irb, this code works pretty well for two cases:
check_start_application('http://example.com', '80')
check_start_application('http://localhost', any_port)
In an app's context this code works for only one case:
check_start_application('http://example.com', '80')
What I tried:
using 'rest-client' instead of 'net/http'
using 'net/https' with http.use_ssl = false
remove times from the method
call sleep before a request
Who faced a similar problem? I believe I'm not the only one.
It may be that on the deploy your app is running on SSL. It's hard to help debug without access to it, but try with:
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true

How to download a binary file via Net::HTTP::Get?

I am trying to download a binary file via HTTP using the following Ruby script.
#!/usr/bin/env ruby
require 'net/http'
require 'uri'
def http_download(resource, filename, debug = false)
uri = URI.parse(resource)
puts "Starting HTTP download for: #{uri}"
http_object = Net::HTTP.new(uri.host, uri.port)
http_object.use_ssl = true if uri.scheme == 'https'
begin
http_object.start do |http|
request = Net::HTTP::Get.new uri.request_uri
Net::HTTP.get_print(uri) if debug
http.read_timeout = 500
http.request request do |response|
open filename, 'w' do |io|
response.read_body do |chunk|
io.write chunk
end
end
end
end
rescue Exception => e
puts "=> Exception: '#{e}'. Skipping download."
return
end
puts "Stored download as #{filename}."
end
However it downloads the HTML source instead of the binary. When I enter the URL in the browser the binary file is downloaded. Here is a URL with which the script fails:
http://dcatlas.dcgis.dc.gov/catalog/download.asp?downloadID=2175&downloadTYPE=KML
I execute the script as follows
pry> require 'myscript'
pry> resource = "http://dcatlas.dcgis.dc.gov/catalog/download.asp?downloadID=2175&downloadTYPE=KML"
pry> http_download(resource,"StreetTreePt.KML", true)
How can I download the binary?
Redirection experiments
I found this redirection check which looks quite reasonable. When I integrate in the response block it fails with the following error:
Exception: 'undefined method `host' for "save_download.asp?filename=StreetTreePt.KML":String'. Skipping download.
The exception does not occur in the "original" function posted above.
The documentation for Net::HTTP shows how to handle redirects:
Following Redirection
Each Net::HTTPResponse object belongs to a class for its response code.
For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.
Using a case statement you can handle various types of responses properly:
def fetch(uri_str, limit = 10)
# You should choose a better exception.
raise ArgumentError, 'too many HTTP redirects' if limit == 0
response = Net::HTTP.get_response(URI(uri_str))
case response
when Net::HTTPSuccess then
response
when Net::HTTPRedirection then
location = response['location']
warn "redirected to #{location}"
fetch(location, limit - 1)
else
response.value
end
end
print fetch('http://www.ruby-lang.org')
Or, you can use Ruby's OpenURI, which handles it automatically. Or, the Curb gem will do it. Probably Typhoeus and HTTPClient too.
According to the code you show in your question, the exception you are getting can only come from:
http_object = Net::HTTP.new(uri.host, uri.port)
which is hardly likely since uri is a URI object. You need to show the complete code if you want help with that problem.

Adjusting timeouts for Nokogiri connections

Why nokogiri waits for couple of secongs (3-5) when the server is busy and I'm requesting pages one by one, but when these request are in a loop, nokogiri does not wait and throws the timeout message.
I'm using timeout block wrapping the request, but nokogiri does not wait for that time at all.
Any suggested procedure on this?
# this is a method from the eng class
def get_page(url,page_type)
begin
timeout(10) do
# Get a Nokogiri::HTML::Document for the page we’re interested in...
##doc = Nokogiri::HTML(open(url))
end
rescue Timeout::Error
puts "Time out connection request"
raise
end
end
# this is a snippet from the main app calling eng class
# receives a hash with urls and goes throgh asking one by one
def retrieve_in_loop(links)
(0..links.length).each do |idx|
url = links[idx]
puts "Visiting link #{idx} of #{links.length}"
puts "link: #{url}"
begin
##eng.get_page(url, product)
rescue Exception => e
puts "Error getting url: #{idx} #{url}"
puts "This link will be skeeped. Continuing with next one"
end
end
end
The timeout block is simply the max time that that code has to execute inside the block without triggering an exception. It does not affect anything inside Nokogiri or OpenURI.
You can set the timeout to a year, but OpenURI can still time out whenever it likes.
So your problem is most likely that OpenURI is timing out on the connection attempt itself. Nokogiri has no timeouts; it's just a parser.
Adjusting read timeout
The only timeout you can adjust on OpenURI is the read timeout. It seems you cannot change the connection timeout through this method:
open(url, :read_timeout => 10)
Adjusting connection timeout
To adjust the connection timeout you would have to go with Net::HTTP directly instead:
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.open_timeout = 10
http.read_timeout = 10
response = http.get(uri.path)
Nokogiri.parse(response.body)
You can also take a look at some additional discussion here:
Ruby Net::HTTP time out
Increase timeout for Net::HTTP

How does thin asynch app example buffer response to the body?

Hi I have been going through the documentation on Thin and I am reasonably new to eventmachine but I am aware of how Deferrables work. My goal is to understand how Thin works when the body is deferred and streamed part by part.
The following is the example that I'm working with and trying to get my head around.
class DeferrableBody
include EventMachine::Deferrable
def call(body)
body.each do |chunk|
#body_callback.call(chunk)
end
# #body_callback.call()
end
def each &blk
#body_callback = blk
end
end
class AsyncApp
# This is a template async response. N.B. Can't use string for body on 1.9
AsyncResponse = [-1, {}, []].freeze
puts "Aysnc testing #{AsyncResponse.inspect}"
def call(env)
body = DeferrableBody.new
# Get the headers out there asap, let the client know we're alive...
EventMachine::next_tick do
puts "Next tick running....."
env['async.callback'].call [200, {'Content-Type' => 'text/plain'}, body]
end
# Semi-emulate a long db request, instead of a timer, in reality we'd be
# waiting for the response data. Whilst this happens, other connections
# can be serviced.
# This could be any callback based thing though, a deferrable waiting on
# IO data, a db request, an http request, an smtp send, whatever.
EventMachine::add_timer(2) do
puts "Timer started.."
body.call ["Woah, async!\n"]
EventMachine::add_timer(5) {
# This could actually happen any time, you could spawn off to new
# threads, pause as a good looking lady walks by, whatever.
# Just shows off how we can defer chunks of data in the body, you can
# even call this many times.
body.call ["Cheers then!"]
puts "Succeed Called."
body.succeed
}
end
# throw :async # Still works for supporting non-async frameworks...
puts "Async REsponse sent."
AsyncResponse # May end up in Rack :-)
end
end
# The additions to env for async.connection and async.callback absolutely
# destroy the speed of the request if Lint is doing it's checks on env.
# It is also important to note that an async response will not pass through
# any further middleware, as the async response notification has been passed
# right up to the webserver, and the callback goes directly there too.
# Middleware could possibly catch :async, and also provide a different
# async.connection and async.callback.
# use Rack::Lint
run AsyncApp.new
The part which I don't clearly understand is what happens within the DeferrableBody class in the call and the each methods.
I get that the each receives chunks of data once the timer fires as blocks stored in #body_callback and when succeed is called on the body it sends the body but when is yield or call called on those blocks how does it become a single message when sent.
I feel I don't understand closures enough to understand whats happening. Would appreciate any help on this.
Thank you.
Ok I think I figured out how the each blocks works.
Thin on post_init seems to be generating a #request and #response object when the connection comes in. The response object needs to respond to an each method. This is the method we override.
The env['async.callback'] is a closure that is assigned to a method called post_process in the connection.rb class method where the data is actually sent to connection which looks like this
#response.each do |chunk|
trace { chunk }
puts "-- [THIN] sending data #{chunk} ---"
send_data chunk
end
How the response object's each is defined
def each
yield head
if #body.is_a?(String)
yield #body
else
#body.each { |chunk| yield chunk }
end
end
So our env['async.callback'] is basically a method called post_process defined in the connection.rb class accessed via method(:post_process) allowing our method to be handled like a closure, which contains access to the #response object. When the reactor starts it first sends the header data in the next_tick where it yields the head, but the body is empty at this point so nothing gets yielded.
After this our each method overrides the old implementation owned by the #response object so when the add_timers fire the post_process which gets triggered sends the data that we supply using the body.call(["Wooah..."]) to the browser (or wherever)
Completely in awe of macournoyer and the team committing to thin. Please correct my understanding if you feel this is not how it works.

Download a zip file through Net::HTTP

I am trying to download the latest.zip from WordPress.org using Net::HTTP. This is what I have got so far:
Net::HTTP.start("wordpress.org/") { |http|
resp = http.get("latest.zip")
open("a.zip", "wb") { |file|
file.write(resp.body)
}
puts "WordPress downloaded"
}
But this only gives me a 4 kilobytes 404 error HTML-page (if I change file to a.txt). I am thinking this has something to do with the URL probably is redirected somehow but I have no clue what I am doing. I am a newbie to Ruby.
My first question is why use Net::HTTP, or code to download something that could be done more easily using curl or wget, which are designed to make it easy to download files?
But, since you want to download things using code, I'd recommend looking at Open-URI if you want to follow redirects. Its a standard library for Ruby, and very useful for fast HTTP/FTP access to pages and files:
require 'open-uri'
open('latest.zip', 'wb') do |fo|
fo.print open('http://wordpress.org/latest.zip').read
end
I just ran that, waited a few seconds for it to finish, ran unzip against the downloaded file "latest.zip", and it expanded into the directory containing their content.
Beyond Open-URI, there's HTTPClient and Typhoeus, among others, that make it easy to open an HTTP connection and send queriers/receive data. They're very powerful and worth getting to know.
NET::HTTP doesn't provide a nice way of following redirects, here is a piece of code that I've been using for a while now:
require 'net/http'
class RedirectFollower
class TooManyRedirects < StandardError; end
attr_accessor :url, :body, :redirect_limit, :response
def initialize(url, limit=5)
#url, #redirect_limit = url, limit
end
def resolve
raise TooManyRedirects if redirect_limit < 0
self.response = Net::HTTP.get_response(URI.parse(url))
if response.kind_of?(Net::HTTPRedirection)
self.url = redirect_url
self.redirect_limit -= 1
resolve
end
self.body = response.body
self
end
def redirect_url
if response['location'].nil?
response.body.match(/<a href=\"([^>]+)\">/i)[1]
else
response['location']
end
end
end
wordpress = RedirectFollower.new('http://wordpress.org/latest.zip').resolve
puts wordpress.url
File.open("latest.zip", "w") do |file|
file.write wordpress.body
end

Resources