How to get HTTP headers before downloading with Ruby's OpenUri - ruby

I am currently using OpenURI to download a file in Ruby. Unfortunately, it seems impossible to get the HTTP headers without downloading the full file:
open(base_url,
:content_length_proc => lambda {|t|
if t && 0 < t
pbar = ProgressBar.create(:total => t)
end
},
:progress_proc => lambda {|s|
pbar.progress = s if pbar
}) {|io|
puts io.size
puts io.meta['content-disposition']
}
Running the code above shows that it first downloads the full file and only then prints the header I need.
Is there a way to get the headers before the full file is downloaded, so I can cancel the download if the headers are not what I expect them to be?

You can use Net::HTTP for this matter, for example:
require 'net/http'
http = Net::HTTP.start('stackoverflow.com')
resp = http.head('/')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish
Another example, this time getting the header of the wonderful book, Object Orient Programming With ANSI-C:
require 'net/http'
http = Net::HTTP.start('www.planetpdf.com')
resp = http.head('/codecuts/pdfs/ooc.pdf')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish

It seems what I wanted is not possible to archieve using OpenURI, at least not, as I said, without loading the whole file first.
I was able to do what I wanted using Net::HTTP's request_get
Here an example:
http.request_get('/largefile.jpg') {|response|
if (response['content-length'] < max_length)
response.read_body do |str| # read body now
# save to file
end
end
}
Note that this only works when using a block, doing it like:
response = http.request_get('/largefile.jpg')
the body will already be read.

Rather than use Net::HTTP, which can be like digging a pool on the beach using a sand shovel, you can use a number of the HTTP clients for Ruby and clean up the code.
Here's a sample using HTTParty:
require 'httparty'
resp = HTTParty.head('http://example.org')
resp.headers
# => {"accept-ranges"=>["bytes"], "cache-control"=>["max-age=604800"], "content-type"=>["text/html"], "date"=>["Thu, 02 Mar 2017 18:52:42 GMT"], "etag"=>["\"359670651\""], "expires"=>["Thu, 09 Mar 2017 18:52:42 GMT"], "last-modified"=>["Fri, 09 Aug 2013 23:54:35 GMT"], "server"=>["ECS (oxr/83AB)"], "x-cache"=>["HIT"], "content-length"=>["1270"], "connection"=>["close"]}
At that point it's easy to check the size of the document:
resp.headers['content-length'] # => "1270"
Unfortunately, the HTTPd you're talking to might not know how big the content will be; In order to respond quickly servers don't necessarily calculate the size of dynamically generated output, which would take almost as long and be almost as CPU intensive as actually sending it, so relying on the "content-length" value might be buggy.
The issue with Net::HTTP is it won't automatically handle redirects, so then you have to add additional code. Granted, that code is supplied in the documentation, but the code keeps growing as you need to do more things, until you've ended up writing yet another http client (YAHC). So, avoid that and use an existing wheel.

Related

Serve HTML files stored on S3 on a Rack app

Say I have some HTML documents stored on S3 likes this:
http://alan.aws-s3-bla-bla.com/posts/1.html
http://alan.aws-s3-bla-bla.com/posts/2.html
http://alan.aws-s3-bla-bla.com/posts/3.html
http://alan.aws-s3-bla-bla.com/posts/1/comments/1.html
http://alan.aws-s3-bla-bla.com/posts/1/comments/2.html
http://alan.aws-s3-bla-bla.com/posts/1/comments/3.html
etc, etc
I'd like to serve these with a Rack (preferably Sinatra) application, mapping the following routes:
get "/posts/:id" do
render "http://alan.aws-s3-bla-bla.com/posts/#{params[:id]}.html"
end
get "/posts/:posts_id/comments/:comments_id" do
render "http://alan.aws-s3-bla-bla.com/posts/#{params[:posts_id]}/comments/#{params[:comments_id}.html"
end
Is this a good idea? How would I do it?
There would obviously be a wait while you grabbed the file, so you could cache it or set etags etc to help with that. I suppose it depends on how long you want to wait and how often it is accessed, its size etc as to whether it's worth storing the HTML locally or remotely. Only you can work that bit out.
If the last expression in the block is a string that will automatically be rendered, so there's no need to call render as long as you've opened the file as a string.
Here's how to grab an external file and put it into a tempfile:
require 'faraday'
require 'faraday_middleware'
#require 'faraday/adapter/typhoeus' # see https://github.com/typhoeus/typhoeus/issues/226#issuecomment-9919517 if you get a problem with the requiring
require 'typhoeus/adapters/faraday'
configure do
Faraday.default_connection = Faraday::Connection.new(
:headers => { :accept => 'text/plain', # maybe this is wrong
:user_agent => "Sinatra via Faraday"}
) do |conn|
conn.use Faraday::Adapter::Typhoeus
end
end
helpers do
def grab_external_html( url )
response = Faraday.get url # you'll need to supply this variable somehow, your choice
filename = url # perhaps change this a bit
tempfile = Tempfile.open(filename, 'wb') { |fp| fp.write(response.body) }
end
end
get "/posts/:whatever/" do
tempfile = grab_external_html whatever # surely you'd do a bit more hereā€¦
tempfile.read
end
This might work. You may also want to think about closing that tempfile, but the garbage collector and the OS should take care of it.

Why doesn't Nokogiri load the full page?

I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?
Here's my code:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
Test:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.
Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.
require 'open-uri'
require 'zlib'
stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
body = stream.read
else
body = Zlib::GzipReader.new(stream).read
end
p body
Here's what you can key off of:
>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []
In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.
Doing all the stuff above and tossing it to:
require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')
should get you back on track.
Do this after all the above to confirm visually you're getting something usable:
p language_part.text.gsub("\t", '')
See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.
After quite a bit of head scratching the problem is here:
> wget -S 'http://en.wikipedia.org/wiki/France'
Resolving en.wikipedia.org... 91.198.174.232
Connecting to en.wikipedia.org|91.198.174.232|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Content-Language: en
Last-Modified: Fri, 01 Jul 2011 23:31:36 GMT
Content-Encoding: gzip <<<<------ BINGO!
...
You need to unpack the gzipped data, which open-uri does not do automatically.
Solution:
def http_get(uri)
url = URI.parse uri
res = Net::HTTP.start(url.host, url.port) { |h|
h.get(url.path)
}
headers = res.to_hash
gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body
content
end
And then:
page = Nokogiri::HTML(http_get("http://en.wikipedia.org/wiki/France"))
require 'open-uri'
require 'zlib'
open('Accept-Encoding' => 'gzip, deflate') do |response|
if response.content_encoding.include?('gzip')
response = Zlib::GzipReader.new(response)
response.define_singleton_method(:method_missing) do |name|
to_io.public_send(name)
end
end
yield response if block_given?
response
end

Using Open-URI to fetch XML and the best practice in case of problems with a remote url not returning/timing out?

Current code works as long as there is no remote error:
def get_name_from_remote_url
cstr = "http://someurl.com"
getresult = open(cstr, "UserAgent" => "Ruby-OpenURI").read
doc = Nokogiri::XML(getresult)
my_data = doc.xpath("/session/name").text
# => 'Fred' or 'Sam' etc
return my_data
end
But, what if the remote URL times out or returns nothing? How I detect that and return nil, for example?
And, does Open-URI give a way to define how long to wait before giving up? This method is called while a user is waiting for a response, so how do we set a max timeoput time before we give up and tell the user "sorry the remote server we tried to access is not available right now"?
Open-URI is convenient, but that ease of use means they're removing the access to a lot of the configuration details the other HTTP clients like Net::HTTP allow.
It depends on what version of Ruby you're using. For 1.8.7 you can use the Timeout module. From the docs:
require 'timeout'
begin
status = Timeout::timeout(5) {
getresult = open(cstr, "UserAgent" => "Ruby-OpenURI").read
}
rescue Timeout::Error => e
puts e.to_s
end
Then check the length of getresult to see if you got any content:
if (getresult.empty?)
puts "got nothing from url"
end
If you are using Ruby 1.9.2 you can add a :read_timeout => 10 option to the open() method.
Also, your code could be tightened up and made a bit more flexible. This will let you pass in a URL or default to the currently used URL. Also read Nokogiri's NodeSet docs to understand the difference between xpath, /, css and at, %, at_css, at_xpath:
def get_name_from_remote_url(cstr = 'http://someurl.com')
doc = Nokogiri::XML(open(cstr, 'UserAgent' => 'Ruby-OpenURI'))
# xpath returns a nodeset which has to be iterated over
# my_data = doc.xpath('/session/name').text # => 'Fred' or 'Sam' etc
# at returns a single node
doc.at('/session/name').text
end

How to parse SOAP response from ruby client?

I am learning Ruby and I have written the following code to find out how to consume SOAP services:
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
weather=service.getTodaysBirthdays('1/26/2010')
The response that I get back is:
#<SOAP::Mapping::Object:0x80ac3714
{http://www.abundanttech.com/webservices/deadoralive} getTodaysBirthdaysResult=#<SOAP::Mapping::Object:0x80ac34a8
{http://www.w3.org/2001/XMLSchema}schema=#<SOAP::Mapping::Object:0x80ac3214
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2f6c
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac2cc4
{http://www.w3.org/2001/XMLSchema}choice=#<SOAP::Mapping::Object:0x80ac2a1c
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2774
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac24cc
{http://www.w3.org/2001/XMLSchema}sequence=#<SOAP::Mapping::Object:0x80ac2224
{http://www.w3.org/2001/XMLSchema}element=[#<SOAP::Mapping::Object:0x80ac1f7c>,
#<SOAP::Mapping::Object:0x80ac13ec>,
#<SOAP::Mapping::Object:0x80ac0a28>,
#<SOAP::Mapping::Object:0x80ac0078>,
#<SOAP::Mapping::Object:0x80abf6c8>,
#<SOAP::Mapping::Object:0x80abed18>]
>>>>>>> {urn:schemas-microsoft-com:xml-diffgram-v1}diffgram=#<SOAP::Mapping::Object:0x80abe6c4
{}NewDataSet=#<SOAP::Mapping::Object:0x80ac1220
{}Table=[#<SOAP::Mapping::Object:0x80ac75e4
{}FullName="Cully, Zara"
{}BirthDate="01/26/1892"
{}DeathDate="02/28/1979"
{}Age="(87)"
{}KnownFor="The Jeffersons"
{}DeadOrAlive="Dead">,
#<SOAP::Mapping::Object:0x80b778f4
{}FullName="Feiffer, Jules"
{}BirthDate="01/26/1929"
{}DeathDate=#<SOAP::Mapping::Object:0x80c7eaf4>
{}Age="81"
{}KnownFor="Cartoonists"
{}DeadOrAlive="Alive">]>>>>
I am having a great deal of difficulty figuring out how to parse and show the returned information in a nice table, or even just how to loop through the records and have access to each element (ie. FullName,Age,etc). I went through the whole "getTodaysBirthdaysResult.methods - Object.new.methods" and kept working down to try and work out how to access the elements, but then I get to the array and I got lost.
Any help that can be offered would be appreciated.
If you're going to parse the XML anyway, you might as well skip SOAP4r and go with Handsoap. Disclaimer: I'm one of the authors of Handsoap.
An example implementation:
# wsdl: http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl
DEADORALIVE_SERVICE_ENDPOINT = {
:uri => 'http://www.abundanttech.com/WebServices/DeadOrAlive/DeadOrAlive.asmx',
:version => 1
}
class DeadoraliveService < Handsoap::Service
endpoint DEADORALIVE_SERVICE_ENDPOINT
def on_create_document(doc)
# register namespaces for the request
doc.alias 'tns', 'http://www.abundanttech.com/webservices/deadoralive'
end
def on_response_document(doc)
# register namespaces for the response
doc.add_namespace 'ns', 'http://www.abundanttech.com/webservices/deadoralive'
end
# public methods
def get_todays_birthdays
soap_action = 'http://www.abundanttech.com/webservices/deadoralive/getTodaysBirthdays'
response = invoke('tns:getTodaysBirthdays', soap_action)
(response/"//NewDataSet/Table").map do |table|
{
:full_name => (table/"FullName").to_s,
:birth_date => Date.strptime((table/"BirthDate").to_s, "%m/%d/%Y"),
:death_date => Date.strptime((table/"DeathDate").to_s, "%m/%d/%Y"),
:age => (table/"Age").to_s.gsub(/^\(([\d]+)\)$/, '\1').to_i,
:known_for => (table/"KnownFor").to_s,
:alive? => (table/"DeadOrAlive").to_s == "Alive"
}
end
end
end
Usage:
DeadoraliveService.get_todays_birthdays
SOAP4R always returns a SOAP::Mapping::Object which is sometimes a bit difficult to work with unless you are just getting the hash values that you can access using hash notation like so
weather['fullName']
However, it does not work when you have an array of hashes. A work around is to get the result in xml format instead of SOAP::Mapping::Object. To do that I will modify your code as
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
service.return_response_as_xml = true
weather=service.getTodaysBirthdays('1/26/2010')
Now the above would give you an xml response which you can parse using nokogiri or REXML. Here is the example using REXML
require 'rexml/document'
rexml = REXML::Document.new(weather)
birthdays = nil
rexml.each_recursive {|element| birthdays = element if element.name == 'getTodaysBirthdaysResult'}
birthdays.each_recursive{|element| puts "#{element.name} = #{element.text}" if element.text}
This will print out all elements that have any text.
So once you have created an xml document you can pretty much do anything depending upon the methods the library you choose has ie. REXML or Nokogiri
Well, Here's my suggestion.
The issue is, you have to snag the right part of the result, one that is something you can actually iterator over. Unfortunately, all the inspecting in the world won't help you because it's a huge blob of unreadable text.
What I do is this:
File.open('myresult.yaml', 'w') {|f| f.write(result.to_yaml) }
This will be a much more human readable format. What you are probably looking for is something like this:
--- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id024 !ruby/object:XSD::QName
name: ListAddressBooksResult <-- Hash name, so it's resul["ListAddressBooksResult"]
namespace: http://apiconnector.com
source:
- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id023 !ruby/object:XSD::QName
name: APIAddressBook <-- this bastard is enumerable :) YAY! so it's result["ListAddressBooksResult"]["APIAddressBook"].each
namespace: http://apiconnector.com
source:
- - !ruby/object:SOAP::Mapping::Object
The above is a result from DotMailer's API, which I spent the last hour trying to figure out how to enumerate over the results. The above is the technique I used to figure out what the heck is going on. I think it beats using REXML etc this way, I could do something like this:
result['ListAddressBooksResult']['APIAddressBook'].each {|book| puts book["Name"]}
Well, I hope this helps anyone else who is looking.
/jason

How to make an HTTP GET with modified headers?

What is the best way to make an HTTP GET request in Ruby with modified headers?
I want to get a range of bytes from the end of a log file and have been toying with the following code, but the server is throwing back a response saying that "it is a request that the server could not understand" (the server is Apache).
require 'net/http'
require 'uri'
#with #address, #port, #path all defined elsewhere
httpcall = Net::HTTP.new(#address, #port)
headers = {
'Range' => 'bytes=1000-'
}
resp, data = httpcall.get2(#path, headers)
Is there a better way to define headers in Ruby?
Does anyone know why this would be failing against Apache? If I do a get in a browser to http://[address]:[port]/[path] I get the data I am seeking without issue.
Created a solution that worked for me (worked very well) - this example getting a range offset:
require 'uri'
require 'net/http'
size = 1000 #the last offset (for the range header)
uri = URI("http://localhost:80/index.html")
http = Net::HTTP.new(uri.host, uri.port)
headers = {
'Range' => "bytes=#{size}-"
}
path = uri.path.empty? ? "/" : uri.path
#test to ensure that the request will be valid - first get the head
code = http.head(path, headers).code.to_i
if (code >= 200 && code < 300) then
#the data is available...
http.get(uri.path, headers) do |chunk|
#provided the data is good, print it...
print chunk unless chunk =~ />416.+Range/
end
end
If you have access to the server logs, try comparing the request from the browser with the one from Ruby and see if that tells you anything. If this isn't practical, fire up Webrick as a mock of the file server. Don't worry about the results, just compare the requests to see what they are doing differently.
As for Ruby style, you could move the headers inline, like so:
httpcall = Net::HTTP.new(#address, #port)
resp, data = httpcall.get2(#path, 'Range' => 'bytes=1000-')
Also, note that in Ruby 1.8+, what you are almost certainly running, Net::HTTP#get2 returns a single HTTPResponse object, not a resp, data pair.

Resources